summary refs log tree commit diff
path: root/src/etc/unicode.py
AgeCommit message (Collapse)AuthorLines
2015-06-06Correctly map upper-case Sigma to lower-case in word-final position. Fix #26035.Simon Sapin-1/+2
2015-06-06Add char::to_titlecaseSimon Sapin-15/+27
But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: https://github.com/rust-lang/rfcs/pull/1054
2015-06-06Add complex (but unconditional) Unicode case mapping. Fix #25800Simon Sapin-10/+44
As a result, the iterator returned by `char::to_uppercase` sometimes yields two or three `char`s instead of just one.
2015-06-06to_lowercase/to_uppercase: also map chars not in Lu/Ll categories.Simon Sapin-18/+18
This adds 120 mappings: Dž dž Dž DŽ Lj lj Lj LJ Nj nj Nj NJ Dz dz Dz DZ Ι ᾈ ᾀ ᾉ ᾁ ᾊ ᾂ ᾋ ᾃ ᾌ ᾄ ᾍ ᾅ ᾎ ᾆ ᾏ ᾇ ᾘ ᾐ ᾙ ᾑ ᾚ ᾒ ᾛ ᾓ ᾜ ᾔ ᾝ ᾕ ᾞ ᾖ ᾟ ᾗ ᾨ ᾠ ᾩ ᾡ ᾪ ᾢ ᾫ ᾣ ᾬ ᾤ ᾭ ᾥ ᾮ ᾦ ᾯ ᾧ ᾼ ᾳ ῌ ῃ ῼ ῳ Ⅰ ⅰ Ⅱ ⅱ Ⅲ ⅲ Ⅳ ⅳ Ⅴ ⅴ Ⅵ ⅵ Ⅶ ⅶ Ⅷ ⅷ Ⅸ ⅸ Ⅹ ⅹ Ⅺ ⅺ Ⅻ ⅻ Ⅼ ⅼ Ⅽ ⅽ Ⅾ ⅾ Ⅿ ⅿ ⅰ Ⅰ ⅱ Ⅱ ⅲ Ⅲ ⅳ Ⅳ ⅴ Ⅴ ⅵ Ⅵ ⅶ Ⅶ ⅷ Ⅷ ⅸ Ⅸ ⅹ Ⅹ ⅺ Ⅺ ⅻ Ⅻ ⅼ Ⅼ ⅽ Ⅽ ⅾ Ⅾ ⅿ Ⅿ Ⓐ ⓐ Ⓑ ⓑ Ⓒ ⓒ Ⓓ ⓓ Ⓔ ⓔ Ⓕ ⓕ Ⓖ ⓖ Ⓗ ⓗ Ⓘ ⓘ Ⓙ ⓙ Ⓚ ⓚ Ⓛ ⓛ Ⓜ ⓜ Ⓝ ⓝ Ⓞ ⓞ Ⓟ ⓟ Ⓠ ⓠ Ⓡ ⓡ Ⓢ ⓢ Ⓣ ⓣ Ⓤ ⓤ Ⓥ ⓥ Ⓦ ⓦ Ⓧ ⓧ Ⓨ ⓨ Ⓩ ⓩ ⓐ Ⓐ ⓑ Ⓑ ⓒ Ⓒ ⓓ Ⓓ ⓔ Ⓔ ⓕ Ⓕ ⓖ Ⓖ ⓗ Ⓗ ⓘ Ⓘ ⓙ Ⓙ ⓚ Ⓚ ⓛ Ⓛ ⓜ Ⓜ ⓝ Ⓝ ⓞ Ⓞ ⓟ Ⓟ ⓠ Ⓠ ⓡ Ⓡ ⓢ Ⓢ ⓣ Ⓣ ⓤ Ⓤ ⓥ Ⓥ ⓦ Ⓦ ⓧ Ⓧ ⓨ Ⓨ ⓩ Ⓩ
2015-04-18optimize Unicode tableskwantam-3/+8
Apply optimization described in https://github.com/rust-lang/regex/pull/73#issuecomment-93777126 to rust's copy of `unicode.py`. This shrinks librustc_unicode's tables.rs from 479kB to 456kB, and should improve performance slightly for related operations (e.g., is_alphabetic(), is_xid_start(), etc). In addition, pull in fix from @dscorbett's commit d25c39f86568a147f9b7080c25711fb1f98f056a in regex, which makes `load_properties()` more tolerant of whitespace in the Unicode tables. (This fix does not result in any changes to tables.rs, but could if the Unicode tables change in the future.)
2015-04-16deprecate Unicode functions that will be moved to crates.iokwantam-4/+7
This patch 1. renames libunicode to librustc_unicode, 2. deprecates several pieces of libunicode (see below), and 3. removes references to deprecated functions from librustc_driver and libsyntax. This may change pretty-printed output from these modules in cases involving wide or combining characters used in filenames, identifiers, etc. The following functions are marked deprecated: 1. char.width() and str.width(): --> use unicode-width crate 2. str.graphemes() and str.grapheme_indices(): --> use unicode-segmentation crate 3. str.nfd_chars(), str.nfkd_chars(), str.nfc_chars(), str.nfkc_chars(), char.compose(), char.decompose_canonical(), char.decompose_compatible(), char.canonical_combining_class(): --> use unicode-normalization crate
2015-04-13Remove regex module from libunicodeChris Wong-44/+7
The regex crate keeps its own tables now (rust-lang/regex#41) so we don't need them here. [breaking-change]
2015-04-06use normative source for Grapheme class datakwantam-57/+23
@mahkoh points out in #15628 that unicode.py does not use normative data for Grapheme classes. This pr fixes that issue. In addition, GC_RegionalIndicator is renamed GC_Regional_Indicator in order to stay in line with the Unicode class name definitions. I have updated refs in u_str.rs, and verified that there are no refs elsewhere in the codebase. However, in principle someone using the unicode tables for their own purposes might see breakage from this.
2015-03-03unicode: Properly parse ranges in UnicodeData.txtFlorian Zeitz-12/+21
This handles the ranges contained in UnicodeData.txt. Counterintuitively this actually makes the tables shorter.
2015-03-02Use `const`s instead of `static`s where appropriateFlorian Zeitz-25/+24
This changes the type of some public constants/statics in libunicode. Notably some `&'static &'static [(char, char)]` have changed to `&'static [(char, char)]`. The regexp crate seems to be the sole user of these, yet this is technically a [breaking-change]
2015-02-15Audit integer types in libunicode, libcore/(char, str) and libstd/asciiVadim Petrochenkov-5/+5
2015-01-25cleanup: s/impl Copy/#[derive(Copy)]/gJorge Aparicio-3/+1
2015-01-17s/deriving/derives in Comments/DocsEarl St Sauver-1/+1
There are a large number of places that incorrectly refer to deriving in comments, instead of derives. Fixes #20984
2014-12-14std: Collapse SlicePrelude traitsAlex Crichton-5/+5
This commit collapses the various prelude traits for slices into just one trait: * SlicePrelude/SliceAllocPrelude => SliceExt * CloneSlicePrelude/CloneSliceAllocPrelude => CloneSliceExt * OrdSlicePrelude/OrdSliceAllocPrelude => OrdSliceExt * PartialEqSlicePrelude => PartialEqSliceExt
2014-12-13Get rid of all the remaining uses of `refN`/`valN`/`mutN`/`TupleN`Jorge Aparicio-3/+2
2014-12-11Register new snapshotsAlex Crichton-14/+13
2014-12-05Utilize fewer reexportsCorey Farwell-7/+9
In regards to: https://github.com/rust-lang/rust/issues/19253#issuecomment-64836729 This commit: * Changes the #deriving code so that it generates code that utilizes fewer reexports (in particur Option::* and Result::*), which is necessary to remove those reexports in the future * Changes other areas of the codebase so that fewer reexports are utilized
2014-11-17Switch to purely namespaced enumsSteven Fackler-0/+1
This breaks code that referred to variant names in the same namespace as their enum. Reexport the variants in the old location or alter code to refer to the new locations: ``` pub enum Foo { A, B } fn main() { let a = A; } ``` => ``` pub use self::Foo::{A, B}; pub enum Foo { A, B } fn main() { let a = A; } ``` or ``` pub enum Foo { A, B } fn main() { let a = Foo::A; } ``` [breaking-change]
2014-11-06rollup merge of #18656 : thiagopnts/rename-deprecated-non_uppercase_staticsAlex Crichton-1/+1
2014-11-06Prelude: rename and consolidate extension traitsAaron Turon-5/+5
This commit renames a number of extension traits for slices and string slices, now that they have been refactored for DST. In many cases, multiple extension traits could now be consolidated. Further consolidation will be possible with generalized where clauses. The renamings are consistent with the [new `-Prelude` suffix](https://github.com/rust-lang/rfcs/pull/344). There are probably a few more candidates for being renamed this way, but that is left for API stabilization of the relevant modules. Because this renames traits, it is a: [breaking-change] However, I do not expect any code that currently uses the standard library to actually break. Closes #17917
2014-11-05rename deprecated non_uppercase_statics to non_upper_case_globalsthiagopnts-1/+1
2014-11-04libsyntax: Forbid escapes in the inclusive range `\x80`-`\xff` inPatrick Walton-1/+1
Unicode characters and strings. Use `\u0080`-`\u00ff` instead. ASCII/byte literals are unaffected. This PR introduces a new function, `escape_default`, into the ASCII module. This was necessary for the pretty printer to continue to function. RFC #326. Closes #18062. [breaking-change]
2014-11-01Replace deprecated missing_doc attribute.Joseph Crail-1/+1
2014-10-13Include the Unicode version used to generate `src/libunicode/tables.rs`.Simon Sapin-0/+9
2014-10-09unicode: Make statics legalAlex Crichton-4/+4
The tables in libunicode are far too large to want to be inlined into any other program, so these tables are all going to remain `static`. For them to be legal, they cannot reference one another by value, but instead use references now. This commit also modifies the src/etc/unicode.py script to generate the right tables.
2014-08-30Unify non-snake-case lints and non-uppercase statics lintsP1start-1/+1
This unifies the `non_snake_case_functions` and `uppercase_variables` lints into one lint, `non_snake_case`. It also now checks for non-snake-case modules. This also extends the non-camel-case types lint to check type parameters, and merges the `non_uppercase_pattern_statics` lint into the `non_uppercase_statics` lint. Because the `uppercase_variables` lint is now part of the `non_snake_case` lint, all non-snake-case variables that start with lowercase characters (such as `fooBar`) will now trigger the `non_snake_case` lint. New code should be updated to use the new `non_snake_case` lint instead of the previous `non_snake_case_functions` and `uppercase_variables` lints. All use of the `non_uppercase_pattern_statics` should be replaced with the `non_uppercase_statics` lint. Any code that previously contained non-snake-case module or variable names should be updated to use snake case names or disable the `non_snake_case` lint. Any code with non-camel-case type parameters should be changed to use camel case or disable the `non_camel_case_types` lint. [breaking-change]
2014-08-13core: Add binary_search and binary_search_elem methods to slices.Brian Anderson-21/+25
These are like the existing bsearch methods but if the search fails, it returns the next insertion point. The new `binary_search` returns a `BinarySearchResult` that is either `Found` or `NotFound`. For convenience, the `found` and `not_found` methods convert to `Option`, ala `Result`. Deprecate bsearch and bsearch_elem.
2014-07-28collections, unicode: Add support for NFC and NFKCFlorian Zeitz-2/+33
2014-07-14add Graphemes iterator; tidy unicode exportskwantam-5/+124
- Graphemes and GraphemeIndices structs implement iterators over grapheme clusters analogous to the Chars and CharOffsets for chars in a string. Iterator and DoubleEndedIterator are available for both. - tidied up the exports for libunicode. crate root exports are now moved into more appropriate module locations: - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module - UnicodeChar exported from char instead of crate root - canonical_combining_class is exported from str rather than crate root Since libunicode's exports have changed, programs that previously relied on the old export locations will need to change their `use` statements to reflect the new ones. See above for more information on where the new exports live. closes #7043 [breaking-change]
2014-07-07Add libunicode; move unicode functions from corekwantam-285/+351
- created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does *not* impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change]
2014-05-13std: Rename str::Normalizations to str::DecompositionsFlorian Zeitz-6/+6
The Normalizations iterator has been renamed to Decompositions. It does not currently include all forms of Unicode normalization, but only encompasses decompositions. If implemented recomposition would likely be a separate iterator which works on the result of this one. [breaking-change]
2014-05-13core: Move Hangul decomposition into unicode.rsFlorian Zeitz-19/+58
2014-05-13std, core: Generate unicode.rs using unicode.pyFlorian Zeitz-55/+76
2014-04-14Use new attribute syntax in python files in src/etc too (#13478)Manish Goregaokar-2/+2
2014-03-20rename std::vec -> std::sliceDaniel Micay-3/+3
Closes #12702
2014-03-13Remove code duplicationPiotr Zolnierek-27/+19
Remove whitespace Update documentation for to_uppercase, to_lowercase
2014-03-13Implement lower, upper case conversion for charPiotr Zolnierek-29/+74
2014-03-13std::unicode: remove unused category tablesPiotr Zolnierek-1/+4
2014-02-05etc: add missing license boilerplatesAdrien Tétar-1/+10
2013-11-27Fix handling of upper/lowercase, and whitespaceFlorian Zeitz-10/+12
2013-11-27Update unicode.py to reflect language changesFlorian Zeitz-5/+5
2013-09-09rename `std::iterator` to `std::iter`Daniel Micay-1/+1
The trait will keep the `Iterator` naming, but a more concise module name makes using the free functions less verbose. The module will define iterables in addition to iterators, as it deals with iteration in general.
2013-09-04stop treating char as an integer typeDaniel Micay-0/+1
Closes #7609
2013-08-21Add canonical combining class to std::unicodeFlorian Zeitz-4/+53
2013-08-21Add Unicode decomposition mappings to std::unicodeFlorian Zeitz-31/+99
2013-07-01rustc: add a lint to enforce uppercase statics.Huon Wilson-0/+1
2013-06-30Convert vec::{bsearch, bsearch_elem} to methods.Huon Wilson-2/+2
2013-06-30etc: update etc/unicode.py for the changes made to std::unicode.Huon Wilson-10/+24
2013-05-02Explain that the source code was generated by this scriptkud1ing-0/+4
2013-04-18core: replace unicode match exprs with bsearch in const arrays, minor perf win.Graydon Hoare-1/+44