about summary refs log tree commit diff
path: root/src/etc/unicode.py
AgeCommit message (Collapse)AuthorLines
2017-05-04Move unicode Python script into libstd_unicode crate.Corey Farwell-591/+0
The only place this Python script is used is inside the libstd_unicode crate, so lets move it there.
2017-01-03Reduce the size of static data in std_unicode::tables.Simon Sapin-6/+58
`BoolTrie` works well for sets of code points spread out through most of Unicode’s range, but is uses a lot of space for sets with few, mostly low, code points. This switches a few of its instances to a similar but simpler trie data structure. ## Before `size_of::<BoolTrie>()` is 1552, which is added to `table.r3.len() * 8 + t.r5.len() + t.r6.len() * 8`: * `Cc_table`: 1632 * `White_Space_table`: 1656 * `Pattern_White_Space_table`: 1640 * Total: 4928 bytes ## After `size_of::<SmallBoolTrie>()` is 32, which is added to `t.r1.len() + t.r2.len() * 8`: * `Cc_table`: 51 * `White_Space_table`: 273 * `Pattern_White_Space_table`: 193 * Total: 517 bytes ## Difference Every Rust program with `std` statically linked should be about 4 KB smaller.
2017-01-02Remove some dead Python code.Simon Sapin-7/+0
It was used to measure before/after size in cfaf66c94e29a38cd3264b4a55c85b90213543d9.
2016-09-17remove useless semicolon from pythonEitan Adler-11/+11
2016-09-17prefer tuple to arrayEitan Adler-2/+2
2016-07-01Update Unicode tables to 9.0Josh Stone-1/+1
2016-04-20Add comment, reduce storage requirementsRaph Levien-6/+33
Adds a comment which explains the trie structure, and also does a little arithmetic on lookup (no measurable impact, looks like modern CPUs do this arithmetic in parallel with the memory lookup to find the node) to save a bit of space. As a result, the memory impact of the compiled tables is within a couple hundred bytes of the old bsearch-range structure.
2016-04-19Fix wrong shift in trie_lookup_range_tableRaph Levien-1/+1
Somehow got in my head that >> 8 was the right shift for a chunk of 64. Oops, sorry.
2016-04-19Efficient trie lookup for boolean Unicode propertiesRaph Levien-4/+107
Replace binary search of ranges with trie lookup using leaves of 64-bit bitmap chunks. Benchmarks suggest this is approximately 10x faster than the bsearch approach.
2016-01-16libsyntax: accept only whitespace with the PATTERN_WHITE_SPACE propertyKevin Butler-2/+2
This aligns with unicode recommendations and should be stable for all future unicode releases. See http://unicode.org/reports/tr31/#R3. This renames `libsyntax::lexer::is_whitespace` to `is_pattern_whitespace` so potentially breaks users of libsyntax.
2016-01-04Improve the range comparisonAndrea Canciani-3/+3
As mentioned in #29734, the range comparison closure can be improved. The LLVM IR and the assembly from the new version are much simpler and unfortunately we cannot rely on the compiler to optimise this much, as it would need to know that `lo <= hi`. Besides from simpler code, there might also be a performance advantage, although it is unlikely to appear on benchmarks, as we are doing a binary search, which should always involve few comparisons. The code is available on the playpen for ease of comparison: http://is.gd/4raMmH
2016-01-04Reuse standard methodsAndrea Canciani-10/+1
Do not hand-code `Result::ok` or `cmp` in tables.rs.
2016-01-04Improve formatting of tables.rsAndrea Canciani-3/+3
Make unicode.py generate a tables.rs which is more conformant to usual Rust formatting (as per `rustfmt`).
2016-01-04Cleanup unicode.pyAndrea Canciani-115/+0
The methods related to char width are dead code since 464cdff102993ff1900eebbf65209e0a3c0be0d5; remove them.
2015-12-05std: Stabilize APIs for the 1.6 releaseAlex Crichton-3/+0
This commit is the standard API stabilization commit for the 1.6 release cycle. The list of issues and APIs below have all been through their cycle-long FCP and the libs team decisions are listed below Stabilized APIs * `Read::read_exact` * `ErrorKind::UnexpectedEof` (renamed from `UnexpectedEOF`) * libcore -- this was a bit of a nuanced stabilization, the crate itself is now marked as `#[stable]` and the methods appearing via traits for primitives like `char` and `str` are now also marked as stable. Note that the extension traits themeselves are marked as unstable as they're imported via the prelude. The `try!` macro was also moved from the standard library into libcore to have the same interface. Otherwise the functions all have copied stability from the standard library now. * The `#![no_std]` attribute * `fs::DirBuilder` * `fs::DirBuilder::new` * `fs::DirBuilder::recursive` * `fs::DirBuilder::create` * `os::unix::fs::DirBuilderExt` * `os::unix::fs::DirBuilderExt::mode` * `vec::Drain` * `vec::Vec::drain` * `string::Drain` * `string::String::drain` * `vec_deque::Drain` * `vec_deque::VecDeque::drain` * `collections::hash_map::Drain` * `collections::hash_map::HashMap::drain` * `collections::hash_set::Drain` * `collections::hash_set::HashSet::drain` * `collections::binary_heap::Drain` * `collections::binary_heap::BinaryHeap::drain` * `Vec::extend_from_slice` (renamed from `push_all`) * `Mutex::get_mut` * `Mutex::into_inner` * `RwLock::get_mut` * `RwLock::into_inner` * `Iterator::min_by_key` (renamed from `min_by`) * `Iterator::max_by_key` (renamed from `max_by`) Deprecated APIs * `ErrorKind::UnexpectedEOF` (renamed to `UnexpectedEof`) * `OsString::from_bytes` * `OsStr::to_cstring` * `OsStr::to_bytes` * `fs::walk_dir` and `fs::WalkDir` * `path::Components::peek` * `slice::bytes::MutableByteVector` * `slice::bytes::copy_memory` * `Vec::push_all` (renamed to `extend_from_slice`) * `Duration::span` * `IpAddr` * `SocketAddr::ip` * `Read::tee` * `io::Tee` * `Write::broadcast` * `io::Broadcast` * `Iterator::min_by` (renamed to `min_by_key`) * `Iterator::max_by` (renamed to `max_by_key`) * `net::lookup_addr` New APIs (still unstable) * `<[T]>::sort_by_key` (added to mirror `min_by_key`) Closes #27585 Closes #27704 Closes #27707 Closes #27710 Closes #27711 Closes #27727 Closes #27740 Closes #27744 Closes #27799 Closes #27801 cc #27801 (doesn't close as `Chars` is still unstable) Closes #28968
2015-10-26rustfmt librustc_unicodeCorentin Henry-6/+11
2015-08-12Remove all unstable deprecated functionalityAlex Crichton-163/+0
This commit removes all unstable and deprecated functions in the standard library. A release was recently cut (1.3) which makes this a good time for some spring cleaning of the deprecated functions.
2015-06-24Remove char::to_titlecase. Fix #26555Simon Sapin-10/+0
I added it because it was easy (same a `char::to_lowercase`, just a different table), but it doesn’t make sense to have this in std but not str::to_titlecase, which would require https://github.com/unicode-rs/unicode-segmentation At some point in the future this feature will be available (both on char and str) in a crates.io crate.
2015-06-06Correctly map upper-case Sigma to lower-case in word-final position. Fix #26035.Simon Sapin-1/+2
2015-06-06Add char::to_titlecaseSimon Sapin-15/+27
But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: https://github.com/rust-lang/rfcs/pull/1054
2015-06-06Add complex (but unconditional) Unicode case mapping. Fix #25800Simon Sapin-10/+44
As a result, the iterator returned by `char::to_uppercase` sometimes yields two or three `char`s instead of just one.
2015-06-06to_lowercase/to_uppercase: also map chars not in Lu/Ll categories.Simon Sapin-18/+18
This adds 120 mappings: Dž dž Dž DŽ Lj lj Lj LJ Nj nj Nj NJ Dz dz Dz DZ Ι ᾈ ᾀ ᾉ ᾁ ᾊ ᾂ ᾋ ᾃ ᾌ ᾄ ᾍ ᾅ ᾎ ᾆ ᾏ ᾇ ᾘ ᾐ ᾙ ᾑ ᾚ ᾒ ᾛ ᾓ ᾜ ᾔ ᾝ ᾕ ᾞ ᾖ ᾟ ᾗ ᾨ ᾠ ᾩ ᾡ ᾪ ᾢ ᾫ ᾣ ᾬ ᾤ ᾭ ᾥ ᾮ ᾦ ᾯ ᾧ ᾼ ᾳ ῌ ῃ ῼ ῳ Ⅰ ⅰ Ⅱ ⅱ Ⅲ ⅲ Ⅳ ⅳ Ⅴ ⅴ Ⅵ ⅵ Ⅶ ⅶ Ⅷ ⅷ Ⅸ ⅸ Ⅹ ⅹ Ⅺ ⅺ Ⅻ ⅻ Ⅼ ⅼ Ⅽ ⅽ Ⅾ ⅾ Ⅿ ⅿ ⅰ Ⅰ ⅱ Ⅱ ⅲ Ⅲ ⅳ Ⅳ ⅴ Ⅴ ⅵ Ⅵ ⅶ Ⅶ ⅷ Ⅷ ⅸ Ⅸ ⅹ Ⅹ ⅺ Ⅺ ⅻ Ⅻ ⅼ Ⅼ ⅽ Ⅽ ⅾ Ⅾ ⅿ Ⅿ Ⓐ ⓐ Ⓑ ⓑ Ⓒ ⓒ Ⓓ ⓓ Ⓔ ⓔ Ⓕ ⓕ Ⓖ ⓖ Ⓗ ⓗ Ⓘ ⓘ Ⓙ ⓙ Ⓚ ⓚ Ⓛ ⓛ Ⓜ ⓜ Ⓝ ⓝ Ⓞ ⓞ Ⓟ ⓟ Ⓠ ⓠ Ⓡ ⓡ Ⓢ ⓢ Ⓣ ⓣ Ⓤ ⓤ Ⓥ ⓥ Ⓦ ⓦ Ⓧ ⓧ Ⓨ ⓨ Ⓩ ⓩ ⓐ Ⓐ ⓑ Ⓑ ⓒ Ⓒ ⓓ Ⓓ ⓔ Ⓔ ⓕ Ⓕ ⓖ Ⓖ ⓗ Ⓗ ⓘ Ⓘ ⓙ Ⓙ ⓚ Ⓚ ⓛ Ⓛ ⓜ Ⓜ ⓝ Ⓝ ⓞ Ⓞ ⓟ Ⓟ ⓠ Ⓠ ⓡ Ⓡ ⓢ Ⓢ ⓣ Ⓣ ⓤ Ⓤ ⓥ Ⓥ ⓦ Ⓦ ⓧ Ⓧ ⓨ Ⓨ ⓩ Ⓩ
2015-04-18optimize Unicode tableskwantam-3/+8
Apply optimization described in https://github.com/rust-lang/regex/pull/73#issuecomment-93777126 to rust's copy of `unicode.py`. This shrinks librustc_unicode's tables.rs from 479kB to 456kB, and should improve performance slightly for related operations (e.g., is_alphabetic(), is_xid_start(), etc). In addition, pull in fix from @dscorbett's commit d25c39f86568a147f9b7080c25711fb1f98f056a in regex, which makes `load_properties()` more tolerant of whitespace in the Unicode tables. (This fix does not result in any changes to tables.rs, but could if the Unicode tables change in the future.)
2015-04-16deprecate Unicode functions that will be moved to crates.iokwantam-4/+7
This patch 1. renames libunicode to librustc_unicode, 2. deprecates several pieces of libunicode (see below), and 3. removes references to deprecated functions from librustc_driver and libsyntax. This may change pretty-printed output from these modules in cases involving wide or combining characters used in filenames, identifiers, etc. The following functions are marked deprecated: 1. char.width() and str.width(): --> use unicode-width crate 2. str.graphemes() and str.grapheme_indices(): --> use unicode-segmentation crate 3. str.nfd_chars(), str.nfkd_chars(), str.nfc_chars(), str.nfkc_chars(), char.compose(), char.decompose_canonical(), char.decompose_compatible(), char.canonical_combining_class(): --> use unicode-normalization crate
2015-04-13Remove regex module from libunicodeChris Wong-44/+7
The regex crate keeps its own tables now (rust-lang/regex#41) so we don't need them here. [breaking-change]
2015-04-06use normative source for Grapheme class datakwantam-57/+23
@mahkoh points out in #15628 that unicode.py does not use normative data for Grapheme classes. This pr fixes that issue. In addition, GC_RegionalIndicator is renamed GC_Regional_Indicator in order to stay in line with the Unicode class name definitions. I have updated refs in u_str.rs, and verified that there are no refs elsewhere in the codebase. However, in principle someone using the unicode tables for their own purposes might see breakage from this.
2015-03-03unicode: Properly parse ranges in UnicodeData.txtFlorian Zeitz-12/+21
This handles the ranges contained in UnicodeData.txt. Counterintuitively this actually makes the tables shorter.
2015-03-02Use `const`s instead of `static`s where appropriateFlorian Zeitz-25/+24
This changes the type of some public constants/statics in libunicode. Notably some `&'static &'static [(char, char)]` have changed to `&'static [(char, char)]`. The regexp crate seems to be the sole user of these, yet this is technically a [breaking-change]
2015-02-15Audit integer types in libunicode, libcore/(char, str) and libstd/asciiVadim Petrochenkov-5/+5
2015-01-25cleanup: s/impl Copy/#[derive(Copy)]/gJorge Aparicio-3/+1
2015-01-17s/deriving/derives in Comments/DocsEarl St Sauver-1/+1
There are a large number of places that incorrectly refer to deriving in comments, instead of derives. Fixes #20984
2014-12-14std: Collapse SlicePrelude traitsAlex Crichton-5/+5
This commit collapses the various prelude traits for slices into just one trait: * SlicePrelude/SliceAllocPrelude => SliceExt * CloneSlicePrelude/CloneSliceAllocPrelude => CloneSliceExt * OrdSlicePrelude/OrdSliceAllocPrelude => OrdSliceExt * PartialEqSlicePrelude => PartialEqSliceExt
2014-12-13Get rid of all the remaining uses of `refN`/`valN`/`mutN`/`TupleN`Jorge Aparicio-3/+2
2014-12-11Register new snapshotsAlex Crichton-14/+13
2014-12-05Utilize fewer reexportsCorey Farwell-7/+9
In regards to: https://github.com/rust-lang/rust/issues/19253#issuecomment-64836729 This commit: * Changes the #deriving code so that it generates code that utilizes fewer reexports (in particur Option::* and Result::*), which is necessary to remove those reexports in the future * Changes other areas of the codebase so that fewer reexports are utilized
2014-11-17Switch to purely namespaced enumsSteven Fackler-0/+1
This breaks code that referred to variant names in the same namespace as their enum. Reexport the variants in the old location or alter code to refer to the new locations: ``` pub enum Foo { A, B } fn main() { let a = A; } ``` => ``` pub use self::Foo::{A, B}; pub enum Foo { A, B } fn main() { let a = A; } ``` or ``` pub enum Foo { A, B } fn main() { let a = Foo::A; } ``` [breaking-change]
2014-11-06rollup merge of #18656 : thiagopnts/rename-deprecated-non_uppercase_staticsAlex Crichton-1/+1
2014-11-06Prelude: rename and consolidate extension traitsAaron Turon-5/+5
This commit renames a number of extension traits for slices and string slices, now that they have been refactored for DST. In many cases, multiple extension traits could now be consolidated. Further consolidation will be possible with generalized where clauses. The renamings are consistent with the [new `-Prelude` suffix](https://github.com/rust-lang/rfcs/pull/344). There are probably a few more candidates for being renamed this way, but that is left for API stabilization of the relevant modules. Because this renames traits, it is a: [breaking-change] However, I do not expect any code that currently uses the standard library to actually break. Closes #17917
2014-11-05rename deprecated non_uppercase_statics to non_upper_case_globalsthiagopnts-1/+1
2014-11-04libsyntax: Forbid escapes in the inclusive range `\x80`-`\xff` inPatrick Walton-1/+1
Unicode characters and strings. Use `\u0080`-`\u00ff` instead. ASCII/byte literals are unaffected. This PR introduces a new function, `escape_default`, into the ASCII module. This was necessary for the pretty printer to continue to function. RFC #326. Closes #18062. [breaking-change]
2014-11-01Replace deprecated missing_doc attribute.Joseph Crail-1/+1
2014-10-13Include the Unicode version used to generate `src/libunicode/tables.rs`.Simon Sapin-0/+9
2014-10-09unicode: Make statics legalAlex Crichton-4/+4
The tables in libunicode are far too large to want to be inlined into any other program, so these tables are all going to remain `static`. For them to be legal, they cannot reference one another by value, but instead use references now. This commit also modifies the src/etc/unicode.py script to generate the right tables.
2014-08-30Unify non-snake-case lints and non-uppercase statics lintsP1start-1/+1
This unifies the `non_snake_case_functions` and `uppercase_variables` lints into one lint, `non_snake_case`. It also now checks for non-snake-case modules. This also extends the non-camel-case types lint to check type parameters, and merges the `non_uppercase_pattern_statics` lint into the `non_uppercase_statics` lint. Because the `uppercase_variables` lint is now part of the `non_snake_case` lint, all non-snake-case variables that start with lowercase characters (such as `fooBar`) will now trigger the `non_snake_case` lint. New code should be updated to use the new `non_snake_case` lint instead of the previous `non_snake_case_functions` and `uppercase_variables` lints. All use of the `non_uppercase_pattern_statics` should be replaced with the `non_uppercase_statics` lint. Any code that previously contained non-snake-case module or variable names should be updated to use snake case names or disable the `non_snake_case` lint. Any code with non-camel-case type parameters should be changed to use camel case or disable the `non_camel_case_types` lint. [breaking-change]
2014-08-13core: Add binary_search and binary_search_elem methods to slices.Brian Anderson-21/+25
These are like the existing bsearch methods but if the search fails, it returns the next insertion point. The new `binary_search` returns a `BinarySearchResult` that is either `Found` or `NotFound`. For convenience, the `found` and `not_found` methods convert to `Option`, ala `Result`. Deprecate bsearch and bsearch_elem.
2014-07-28collections, unicode: Add support for NFC and NFKCFlorian Zeitz-2/+33
2014-07-14add Graphemes iterator; tidy unicode exportskwantam-5/+124
- Graphemes and GraphemeIndices structs implement iterators over grapheme clusters analogous to the Chars and CharOffsets for chars in a string. Iterator and DoubleEndedIterator are available for both. - tidied up the exports for libunicode. crate root exports are now moved into more appropriate module locations: - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module - UnicodeChar exported from char instead of crate root - canonical_combining_class is exported from str rather than crate root Since libunicode's exports have changed, programs that previously relied on the old export locations will need to change their `use` statements to reflect the new ones. See above for more information on where the new exports live. closes #7043 [breaking-change]
2014-07-07Add libunicode; move unicode functions from corekwantam-285/+351
- created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does *not* impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change]
2014-05-13std: Rename str::Normalizations to str::DecompositionsFlorian Zeitz-6/+6
The Normalizations iterator has been renamed to Decompositions. It does not currently include all forms of Unicode normalization, but only encompasses decompositions. If implemented recomposition would likely be a separate iterator which works on the result of this one. [breaking-change]
2014-05-13core: Move Hangul decomposition into unicode.rsFlorian Zeitz-19/+58