| Age | Commit message (Collapse) | Author | Lines |
|
|
|
They are only used by rustc_lexer, and are not needed elsewhere.
So we move the relevant definitions into rustc_lexer (while the actual
unicode data comes from the unicode-xid crate) and make the rest of
the compiler use it.
|
|
They were reachable through opaque macros defined in `core`
|
|
allow clippy::unreadable_literal in unicode tables
Also modifies the generation script to emit 2018 edition paths.
|
|
Also modifies the generation script to emit 2018 edition paths.
|
|
|
|
|
|
Refactor unicode.py script
Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit.
## Escape deprecation
OK (note the `r`):
`re.compile(r"\d")`
Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]):
`re.compile("\d")`.
[link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior
[link2]: https://bugs.python.org/issue27364
This was evident running the script using Python 3.7 like so:
```
$ python3 -Wall unicode.py
unicode.py:227: DeprecationWarning: invalid escape sequence \w
re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
unicode.py:228: DeprecationWarning: invalid escape sequence \.
re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
unicode.py:453: DeprecationWarning: invalid escape sequence \d
pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"
```
The documentation states that
> A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases.
## Testing
To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in).
The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings:
```
$ python3 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 -Wall unicode.py -v 11.0.0
Using Unicode version: 11.0.0
Regenerated tables.rs.
$ git diff tables.rs
$ python2 --version
Python 2.7.16
$ python3 --version
Python 3.7.3
```
## Extra functionality
Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below:
```
$ ./unicode.py --help
usage: unicode.py [-h] [-v VERSION]
Regenerate Unicode tables (tables.rs).
optional arguments:
-h, --help show this help message and exit
-v VERSION, --version VERSION
Unicode version to use (if not specified, defaults to
latest available final release).
```
## Cleanups
I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire.
One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version:
```
$ pwd
.../rust/src/libcore/unicode
$ exa -T downloaded/
downloaded
├── 11.0.0
│ ├── DerivedCoreProperties.txt
│ ├── DerivedNormalizationProps.txt
│ ├── PropList.txt
│ ├── ReadMe.txt
│ ├── Scripts.txt
│ ├── SpecialCasing.txt
│ └── UnicodeData.txt
└── 12.0.0
├── DerivedCoreProperties.txt
├── DerivedNormalizationProps.txt
├── PropList.txt
├── ReadMe.txt
├── Scripts.txt
├── SpecialCasing.txt
└── UnicodeData.txt
```
|
|
|
|
Co-Authored-By: varkor <github@varkor.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Treat gc=No characters as numeric
[`char::is_numeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_numeric) and [`char::is_alphanumeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_alphanumeric) are documented to be defined “in terms of the Unicode General Categories 'Nd', 'Nl', 'No'”, but unicode.py does not group 'No' with the other 'N' categories. These functions therefore currently return `false` for characters like ⟨¾⟩ and ⟨①⟩.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This code is assuming that usize >= 32bits, but it is not the case on
16bit targets. It is producing a warning that will fail the compilation
on MSP430 if deny(warnings) is enabled.
It is very unlikely that someone would actually use this code on
a microcontroller, but since unicode was merged into libcore we
have compile it on 16bit targets.
|
|
|
|
|
|
|
|
|
|
And the UnicodeStr trait into StrExt
|
|
|
|
|
|
|
|
|
|
|