| Age | Commit message (Collapse) | Author | Lines |
|
The ASCII subset of Unicode is fixed and will never change, so we don't
need to generate tables for it with every new Unicode version. This
saves a few bytes of static data and speeds up `char::is_control` and
`char::is_grapheme_extended` on ASCII inputs.
Since the table lookup functions exported from the `unicode` module will
give nonsensical errors on ASCII input (and in fact will panic in debug
mode), I had to add some private wrapper methods to `char` which check
for ASCII-ness first.
|
|
|
|
r=joshtriplett,tgross35
unicode-table-generator refactors
Split off from https://github.com/rust-lang/rust/pull/145219
|
|
According to
https://www.unicode.org/policies/stability_policy.html#Property_Value,
the set of codepoints in `Cc` will never change. So we can hard-code
the patterns to match against instead of using a table.
|
|
|
|
Rewrite `generate_tests` to be more idiomatic.
|
|
The `merge_ranges` function was very complicated and hard to understand.
Forunately, we can use `slice::chunk_by` to achieve the same thing.
|
|
Include the sizes of the `to_lowercase` and `to_uppercase` tables in the
total size calculations.
|
|
To make changes in table size obvious from git diffs
|
|
unicode-table-generator panicked while populating distinct_indices
because of duplicated indices. This was introduced by swapping the
order of canonical_words.push(...) and canonical_words.len().
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
make char::is_whitespace unstably const
I am adding this to the existing https://github.com/rust-lang/rust/issues/132241 feature gate, since `is_digit` and `is_whitespace` seem similar enough that one can group them together.
|
|
unicode_data.rs: show command for generating file
https://github.com/rust-lang/rust/pull/131647 made this an easily runnable tool, now we just have to mention that in the comment. :)
Fixes https://github.com/rust-lang/rust/issues/131640.
|
|
|
|
|
|
|
|
Register `src/tools/unicode-table-generator` as a runnable tool
It seems like `src/tools/unicode-table-generator` is not currently managed by bootstrap. This PR wires it up with bootstrap as a runnable tool.
This tool seems to take two possible args:
1. (Mandatory) path to `library/core/src/unicode/unicode_data.rs`, and
2. (Optional) path to generate a test file.
I only passed the mandatory path to `unicode_data.rs` in bootstrap and didn't do anything about (2). I'm not sure about how this tool is supposed to be run.
`Cargo.lock` is modified because I renamed `unicode-table-generator`'s bin name to match the tool name, as bootstrap's tool running logic expects the bin name to be derived from the tool name.
I also added a triagebot message to remind to not manually edit the library source file and edit the tool then regenerate instead, but this should probably be a tidy check (if that's desirable then that can be in a follow-up PR, though may be overkill).
Helps with #131640 but does not close it because still no docs.
r? `@Mark-Simulacrum` (since I think you authored this tool?)
|
|
These comments were updated on master but not through this tool, so the
comments in the tool became outdated. Sync the comments to stay
consistent.
|
|
Bootstrap assumes that the binary name is the same as tool name, just
makes everyone's lives easier.
|
|
|
|
|
|
The previous commit updated `rustfmt.toml` appropriately. This commit is
the outcome of running `x fmt --all` with the new formatting options.
|
|
This adds a dedicated check for the lower bound
(if it is outside of ASCII range) to the output of the `unicode-table-generator` tool.
This generalized the ASCII-only fast-path, but only for the `Grapheme_Extend` property for now,
as that is the only one with a lower bound outside of ASCII.
|
|
|
|
|
|
The indices are encoded as `u32`s in the range of invalid `char`s, so
that we know that if any mapping fails to parse as a `char` we should
use the value for lookup in the multi-table.
This avoids the second binary search in cases where a multi-`char`
mapping is needed.
Idea from @nikic
|
|
The majority of char case replacements are single char replacements,
so storing them as [char; 3] wastes a lot of space.
This commit splits the replacement tables for both `to_lower` and
`to_upper` into two separate tables, one with single-character mappings
and one with multi-character mappings.
This reduces the binary size for programs using all of these tables
with roughly 24K bytes.
|
|
Since ascii chars are already handled by a special case in the
`to_lower` and `to_upper` functions, there's no need to waste space on
them in the LUTs.
|
|
|
|
Implements #101400.
|
|
Avoid cloning a collection only to iterate over it
`@rustbot` label: +C-cleanup
|
|
Smaller improvements of tidy and the unicode generator
|
|
|
|
Reduce duplication by moving fetching logic into a dedicated function.
|
|
|
|
|
|
|
|
|