rust/src/etc/unicode.py, branch 1.3.0

rust/src/etc/unicode.py, branch 1.3.0 https://github.com/rust-lang/rust http://git.dreamy.place/mirrors/rust/atom?h=1.3.0 2015-06-25T05:16:25+00:00 Remove char::to_titlecase. Fix #26555 2015-06-25T05:16:25+00:00 Simon Sapin simon.sapin@exyr.org 2015-06-25T05:14:27+00:00 urn:sha1:32b7b50bafd0d40fda1caa93cf7f7068cc5052e3 I added it because it was easy (same a `char::to_lowercase`, just a different table), but it doesn’t make sense to have this in std but not str::to_titlecase, which would require https://github.com/unicode-rs/unicode-segmentation At some point in the future this feature will be available (both on char and str) in a crates.io crate. Correctly map upper-case Sigma to lower-case in word-final position. Fix #26035. 2015-06-06T10:37:11+00:00 Simon Sapin simon.sapin@exyr.org 2015-06-06T10:34:24+00:00 urn:sha1:f901086b0db092f331b9555199298c58d685f668 Add char::to_titlecase 2015-06-06T10:37:11+00:00 Simon Sapin simon.sapin@exyr.org 2015-06-05T17:20:09+00:00 urn:sha1:d316487ec1870956b5ba13468c39b61577e6858f But not str::to_titlecase which would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: https://github.com/rust-lang/rfcs/pull/1054 Add complex (but unconditional) Unicode case mapping. Fix #25800 2015-06-06T10:37:10+00:00 Simon Sapin simon.sapin@exyr.org 2015-06-05T15:40:09+00:00 urn:sha1:addaa5b1ff0d611b6568ce5fb0c6469a8e1a6ee4 As a result, the iterator returned by `char::to_uppercase` sometimes yields two or three `char`s instead of just one. to_lowercase/to_uppercase: also map chars not in Lu/Ll categories. 2015-06-06T10:37:10+00:00 Simon Sapin simon.sapin@exyr.org 2015-06-05T14:23:51+00:00 urn:sha1:66af12721a3200f872adf38e0015e22db88cd86e This adds 120 mappings: ǅ ǆ ǅ Ǆ ǈ ǉ ǈ Ǉ ǋ ǌ ǋ Ǌ ǲ ǳ ǲ Ǳ Ι ᾈ ᾀ ᾉ ᾁ ᾊ ᾂ ᾋ ᾃ ᾌ ᾄ ᾍ ᾅ ᾎ ᾆ ᾏ ᾇ ᾘ ᾐ ᾙ ᾑ ᾚ ᾒ ᾛ ᾓ ᾜ ᾔ ᾝ ᾕ ᾞ ᾖ ᾟ ᾗ ᾨ ᾠ ᾩ ᾡ ᾪ ᾢ ᾫ ᾣ ᾬ ᾤ ᾭ ᾥ ᾮ ᾦ ᾯ ᾧ ᾼ ᾳ ῌ ῃ ῼ ῳ Ⅰ ⅰ Ⅱ ⅱ Ⅲ ⅲ Ⅳ ⅳ Ⅴ ⅴ Ⅵ ⅵ Ⅶ ⅶ Ⅷ ⅷ Ⅸ ⅸ Ⅹ ⅹ Ⅺ ⅺ Ⅻ ⅻ Ⅼ ⅼ Ⅽ ⅽ Ⅾ ⅾ Ⅿ ⅿ ⅰ Ⅰ ⅱ Ⅱ ⅲ Ⅲ ⅳ Ⅳ ⅴ Ⅴ ⅵ Ⅵ ⅶ Ⅶ ⅷ Ⅷ ⅸ Ⅸ ⅹ Ⅹ ⅺ Ⅺ ⅻ Ⅻ ⅼ Ⅼ ⅽ Ⅽ ⅾ Ⅾ ⅿ Ⅿ Ⓐ ⓐ Ⓑ ⓑ Ⓒ ⓒ Ⓓ ⓓ Ⓔ ⓔ Ⓕ ⓕ Ⓖ ⓖ Ⓗ ⓗ Ⓘ ⓘ Ⓙ ⓙ Ⓚ ⓚ Ⓛ ⓛ Ⓜ ⓜ Ⓝ ⓝ Ⓞ ⓞ Ⓟ ⓟ Ⓠ ⓠ Ⓡ ⓡ Ⓢ ⓢ Ⓣ ⓣ Ⓤ ⓤ Ⓥ ⓥ Ⓦ ⓦ Ⓧ ⓧ Ⓨ ⓨ Ⓩ ⓩ ⓐ Ⓐ ⓑ Ⓑ ⓒ Ⓒ ⓓ Ⓓ ⓔ Ⓔ ⓕ Ⓕ ⓖ Ⓖ ⓗ Ⓗ ⓘ Ⓘ ⓙ Ⓙ ⓚ Ⓚ ⓛ Ⓛ ⓜ Ⓜ ⓝ Ⓝ ⓞ Ⓞ ⓟ Ⓟ ⓠ Ⓠ ⓡ Ⓡ ⓢ Ⓢ ⓣ Ⓣ ⓤ Ⓤ ⓥ Ⓥ ⓦ Ⓦ ⓧ Ⓧ ⓨ Ⓨ ⓩ Ⓩ optimize Unicode tables 2015-04-18T17:20:57+00:00 kwantam kwantam@gmail.com 2015-04-16T19:38:35+00:00 urn:sha1:f14d289d71fd8e4956e7214bda3af15cd50898fe Apply optimization described in https://github.com/rust-lang/regex/pull/73#issuecomment-93777126 to rust's copy of `unicode.py`. This shrinks librustc_unicode's tables.rs from 479kB to 456kB, and should improve performance slightly for related operations (e.g., is_alphabetic(), is_xid_start(), etc). In addition, pull in fix from @dscorbett's commit d25c39f86568a147f9b7080c25711fb1f98f056a in regex, which makes `load_properties()` more tolerant of whitespace in the Unicode tables. (This fix does not result in any changes to tables.rs, but could if the Unicode tables change in the future.) deprecate Unicode functions that will be moved to crates.io 2015-04-16T21:03:05+00:00 kwantam kwantam@gmail.com 2015-04-14T19:52:37+00:00 urn:sha1:29d1252e4d2126318d7f622505ed76dd1e8e4edc This patch 1. renames libunicode to librustc_unicode, 2. deprecates several pieces of libunicode (see below), and 3. removes references to deprecated functions from librustc_driver and libsyntax. This may change pretty-printed output from these modules in cases involving wide or combining characters used in filenames, identifiers, etc. The following functions are marked deprecated: 1. char.width() and str.width(): --> use unicode-width crate 2. str.graphemes() and str.grapheme_indices(): --> use unicode-segmentation crate 3. str.nfd_chars(), str.nfkd_chars(), str.nfc_chars(), str.nfkc_chars(), char.compose(), char.decompose_canonical(), char.decompose_compatible(), char.canonical_combining_class(): --> use unicode-normalization crate Remove regex module from libunicode 2015-04-12T22:30:10+00:00 Chris Wong lambda.fairy@gmail.com 2015-04-12T01:24:19+00:00 urn:sha1:5308ac939a330b74540bea5920b0086a2d954648 The regex crate keeps its own tables now (rust-lang/regex#41) so we don't need them here. [breaking-change] use normative source for Grapheme class data 2015-04-06T23:46:48+00:00 kwantam kwantam@gmail.com 2015-04-06T23:42:18+00:00 urn:sha1:bef00ab2b82f75e267a3bf19e511f21e41e41b9a @mahkoh points out in #15628 that unicode.py does not use normative data for Grapheme classes. This pr fixes that issue. In addition, GC_RegionalIndicator is renamed GC_Regional_Indicator in order to stay in line with the Unicode class name definitions. I have updated refs in u_str.rs, and verified that there are no refs elsewhere in the codebase. However, in principle someone using the unicode tables for their own purposes might see breakage from this. unicode: Properly parse ranges in UnicodeData.txt 2015-03-03T19:04:55+00:00 Florian Zeitz florob@babelmonkeys.de 2015-03-03T17:35:41+00:00 urn:sha1:c9e2de42b590c6d294afd1db44334c5168a694bb This handles the ranges contained in UnicodeData.txt. Counterintuitively this actually makes the tables shorter.