about summary refs log tree commit diff
path: root/src/libcore/unicode
AgeCommit message (Collapse)AuthorLines
2019-09-06it's more pythonic to use 'is not None' in python filesGuanqun Lu-1/+1
2019-09-04remove XID and Pattern_White_Space unicode tables from libcoreAleksey Kladov-385/+4
They are only used by rustc_lexer, and are not needed elsewhere. So we move the relevant definitions into rustc_lexer (while the actual unicode data comes from the unicode-xid crate) and make the rest of the compiler use it.
2019-08-05Make some items in core::unicode privateMatthew Jasper-20/+20
They were reachable through opaque macros defined in `core`
2019-07-26Rollup merge of #62084 - euclio:unicode-table-tweak, r=kennytmMazdak Farrokhzad-2/+2
allow clippy::unreadable_literal in unicode tables Also modifies the generation script to emit 2018 edition paths.
2019-07-12allow clippy::unreadable_literal in unicode tablesAndy Russell-4/+4
Also modifies the generation script to emit 2018 edition paths.
2019-07-12Regenerate character tables for Unicode 12.1Josh Stone-734/+765
2019-07-12Update unicode scripts for the current coding styleJosh Stone-5/+5
2019-07-06Rollup merge of #60081 - pawroman:cleanup_unicode_script, r=varkorMazdak Farrokhzad-352/+740
Refactor unicode.py script Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit. ## Escape deprecation OK (note the `r`): `re.compile(r"\d")` Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]): `re.compile("\d")`. [link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior [link2]: https://bugs.python.org/issue27364 This was evident running the script using Python 3.7 like so: ``` $ python3 -Wall unicode.py unicode.py:227: DeprecationWarning: invalid escape sequence \w re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)") unicode.py:228: DeprecationWarning: invalid escape sequence \. re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)") unicode.py:453: DeprecationWarning: invalid escape sequence \d pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode" ``` The documentation states that > A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. ## Testing To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in). The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings: ``` $ python3 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 --version Python 2.7.16 $ python3 --version Python 3.7.3 ``` ## Extra functionality Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below: ``` $ ./unicode.py --help usage: unicode.py [-h] [-v VERSION] Regenerate Unicode tables (tables.rs). optional arguments: -h, --help show this help message and exit -v VERSION, --version VERSION Unicode version to use (if not specified, defaults to latest available final release). ``` ## Cleanups I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire. One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version: ``` $ pwd .../rust/src/libcore/unicode $ exa -T downloaded/ downloaded ├── 11.0.0 │ ├── DerivedCoreProperties.txt │ ├── DerivedNormalizationProps.txt │ ├── PropList.txt │ ├── ReadMe.txt │ ├── Scripts.txt │ ├── SpecialCasing.txt │ └── UnicodeData.txt └── 12.0.0 ├── DerivedCoreProperties.txt ├── DerivedNormalizationProps.txt ├── PropList.txt ├── ReadMe.txt ├── Scripts.txt ├── SpecialCasing.txt └── UnicodeData.txt ```
2019-07-01Address review remarks in unicode.pyPaweł Romanowski-55/+61
2019-06-10Apply suggestions from code reviewPaweł Romanowski-4/+5
Co-Authored-By: varkor <github@varkor.com>
2019-04-19Refactor and document unicode.py scriptPaweł Romanowski-302/+518
2019-04-18Fix tidy errorsPaweł Romanowski-2/+3
2019-04-18More cleanups for unicode.pyPaweł Romanowski-25/+23
2019-04-18Clean up unicode.py scriptPaweł Romanowski-103/+269
2019-04-18libcore => 2018Taiki Endo-5/+5
2018-12-25Remove licensesMark Rousskov-90/+1
2018-12-04cleanup: remove static lifetimes from constsljedrz-6/+6
2018-11-10revert making internal APIs const fn.Mazdak Farrokhzad-1/+1
2018-11-10constify parts of libcore.Mazdak Farrokhzad-2/+1
2018-08-01Auto merge of #51609 - dscorbett:is_numeric, r=alexcrichtonbors-30/+44
Treat gc=No characters as numeric [`char::is_numeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_numeric) and [`char::is_alphanumeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_alphanumeric) are documented to be defined “in terms of the Unicode General Categories 'Nd', 'Nl', 'No'”, but unicode.py does not group 'No' with the other 'N' categories. These functions therefore currently return `false` for characters like ⟨¾⟩ and ⟨①⟩.
2018-07-06Handle array manually in string case conversion methodsPazzaz-0/+3
2018-06-17Treat gc=No characters as numericDavid Corbett-30/+44
2018-06-11Regenerate character tables for Unicode 11Josh Stone-1120/+1214
2018-05-21Fix tables.rsvarkor-6/+45
2018-05-21Avoid counting characters and add explanatory comment to testvarkor-1/+1
2018-05-21Use Grapheme_Extend instead of Mnvarkor-166/+129
2018-05-21Use the correct output directory for downloading Unicode filesvarkor-2/+1
2018-05-21Escape combining characters in escape_debugvarkor-1/+1
2018-05-21Keep tables.rs copyright notice up to datevarkor-5/+5
2018-05-21Download unicode data files in directory of unicode.pyvarkor-7/+11
2018-05-21Update unicode/tables.rs with Mnvarkor-1/+121
2018-05-01Fix a warning in libcore on 16bit targets.Vadzim Dambrouski-8/+8
This code is assuming that usize >= 32bits, but it is not the case on 16bit targets. It is producing a warning that will fail the compilation on MSP430 if deny(warnings) is enabled. It is very unlikely that someone would actually use this code on a microcontroller, but since unicode was merged into libcore we have compile it on 16bit targets.
2018-04-12Mark the rest of the `unicode` feature flag as perma-unstable.Simon Sapin-1/+1
2018-04-12Dedicated tracking issue for UnicodeVersion and UNICODE_VERSION.Simon Sapin-0/+3
2018-04-12Move core::char::printable to core::unicode::printableSimon Sapin-0/+786
2018-04-12Merge unstable Utf16Encoder into EncodeUtf16Simon Sapin-58/+0
2018-04-12Merge core::unicode::str into core::strSimon Sapin-188/+58
And the UnicodeStr trait into StrExt
2018-04-12Remove the CharExt trait, now that libcore has inherent methods for charSimon Sapin-6/+3
2018-04-12Move the rest of core::unicode::char to core::unicodeSimon Sapin-1438/+0
2018-04-12Move char decoding iterators into a separate private module.Simon Sapin-129/+0
2018-04-12Reexport from core::unicode::char in core::char rather than vice versaSimon Sapin-23/+4
2018-04-12Move contents of libstd_unicode into libcoreSimon Sapin-0/+4782