about summary refs log tree commit diff
path: root/src/tools/unicode-table-generator
AgeCommit message (Collapse)AuthorLines
2025-09-07optimization: Don't include ASCII characters in Unicode tablesKarl Meakin-0/+5
The ASCII subset of Unicode is fixed and will never change, so we don't need to generate tables for it with every new Unicode version. This saves a few bytes of static data and speeds up `char::is_control` and `char::is_grapheme_extended` on ASCII inputs. Since the table lookup functions exported from the `unicode` module will give nonsensical errors on ASCII input (and in fact will panic in debug mode), I had to add some private wrapper methods to `char` which check for ASCII-ness first.
2025-09-05change file-is-generated doc comment to innerMarijn Schouten-1/+1
2025-09-03Rollup merge of #145414 - Kmeakin:km/unicode-table-refactors, ↵Stuart Cook-149/+143
r=joshtriplett,tgross35 unicode-table-generator refactors Split off from https://github.com/rust-lang/rust/pull/145219
2025-08-16refactor: Hard-code `char::is_control`Karl Meakin-1/+0
According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.
2025-08-15refactor: Add tests for case conversionsKarl Meakin-11/+41
2025-08-15refactor: `generate_tests`Karl Meakin-52/+45
Rewrite `generate_tests` to be more idiomatic.
2025-08-15refactor: rewrite `ranges_from_set`Karl Meakin-66/+17
The `merge_ranges` function was very complicated and hard to understand. Forunately, we can use `slice::chunk_by` to achieve the same thing.
2025-08-15refactor: Include size of case conversion tablesKarl Meakin-13/+35
Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.
2025-08-15refactor: Include table sizes in comment at top of `unicode_data.rs`Karl Meakin-11/+9
To make changes in table size obvious from git diffs
2025-08-05fix(unicode-table-generator): fix duplicated unique indicesMarco Cavenati-1/+1
unicode-table-generator panicked while populating distinct_indices because of duplicated indices. This was introduced by swapping the order of canonical_words.push(...) and canonical_words.len().
2025-07-18unicode-table-gen: more clippy fixesMarijn Schouten-8/+8
2025-07-18unicode-table-gen: edition 2024Marijn Schouten-2/+2
2025-07-18unicode-table-gen: clippy fixesMarijn Schouten-35/+28
2025-07-10Remove uncessary parens in closure body with unused lintyukang-1/+1
2025-03-08Remove unneeded parentheses.Markus Reiter-1/+1
2025-03-08Fix formatting.Markus Reiter-34/+7
2025-03-07Use `intrinsics::assume` instead of `hint::assert_unchecked`.Markus Reiter-2/+8
2025-03-07Never inline `lookup_slow`.Markus Reiter-0/+2
2025-03-06Add second precondition for `skip_search`.Markus Reiter-28/+89
2025-03-06Allow optimizing out `panic_bounds_check` in Unicode checks.Markus Reiter-14/+31
2025-02-08Rustfmtbjorn3-21/+27
2024-11-27update cfgsBoxy-5/+0
2024-11-12stabilize const_unicode_case_lookupRalf Jung-0/+5
2024-11-06Auto merge of #132500 - RalfJung:char-is-whitespace-const, r=jhprattbors-1/+1
make char::is_whitespace unstably const I am adding this to the existing https://github.com/rust-lang/rust/issues/132241 feature gate, since `is_digit` and `is_whitespace` seem similar enough that one can group them together.
2024-11-03Rollup merge of #132499 - RalfJung:unicode_data.rs, r=tgross35Matthias Krüger-1/+1
unicode_data.rs: show command for generating file https://github.com/rust-lang/rust/pull/131647 made this an easily runnable tool, now we just have to mention that in the comment. :) Fixes https://github.com/rust-lang/rust/issues/131640.
2024-11-02make char::is_whitespace unstably constRalf Jung-1/+1
2024-11-02unicode_data.rs: show command for generating fileRalf Jung-1/+1
2024-11-02get rid of a whole bunch of unnecessary rustc_const_unstable attributesRalf Jung-6/+0
2024-10-20Rollup merge of #131647 - jieyouxu:unicode-table-generator, r=Mark-SimulacrumMatthias Krüger-5/+3
Register `src/tools/unicode-table-generator` as a runnable tool It seems like `src/tools/unicode-table-generator` is not currently managed by bootstrap. This PR wires it up with bootstrap as a runnable tool. This tool seems to take two possible args: 1. (Mandatory) path to `library/core/src/unicode/unicode_data.rs`, and 2. (Optional) path to generate a test file. I only passed the mandatory path to `unicode_data.rs` in bootstrap and didn't do anything about (2). I'm not sure about how this tool is supposed to be run. `Cargo.lock` is modified because I renamed `unicode-table-generator`'s bin name to match the tool name, as bootstrap's tool running logic expects the bin name to be derived from the tool name. I also added a triagebot message to remind to not manually edit the library source file and edit the tool then regenerate instead, but this should probably be a tidy check (if that's desirable then that can be in a follow-up PR, though may be overkill). Helps with #131640 but does not close it because still no docs. r? `@Mark-Simulacrum` (since I think you authored this tool?)
2024-10-13unicode-table-generator: sync comments许杰友 Jieyou Xu (Joe)-4/+2
These comments were updated on master but not through this tool, so the comments in the tool became outdated. Sync the comments to stay consistent.
2024-10-13unicode-table-generator: match bin name with tool name许杰友 Jieyou Xu (Joe)-1/+1
Bootstrap assumes that the binary name is the same as tool name, just makes everyone's lives easier.
2024-10-13switch unicode-data back to 'static'Ralf Jung-4/+4
2024-09-22Reformat using the new identifier sorting from rustfmtMichael Goulet-29/+23
2024-07-29Reformat `use` declarations.Nicholas Nethercote-11/+15
The previous commit updated `rustfmt.toml` appropriately. This commit is the outcome of running `x fmt --all` with the new formatting options.
2024-04-20Add a lower bound check to `unicode-table-generator` outputArpad Borsos-3/+27
This adds a dedicated check for the lower bound (if it is outside of ASCII range) to the output of the `unicode-table-generator` tool. This generalized the ASCII-only fast-path, but only for the `Grapheme_Extend` property for now, as that is the only one with a lower bound outside of ASCII.
2023-04-12remove some unneeded importsKaDiWa-2/+0
2023-03-21Use hex literal for INDEX_MASKMartin Gammelsæter-1/+1
2023-03-16Improve case mapping encoding schemeMartin Gammelsæter-49/+54
The indices are encoded as `u32`s in the range of invalid `char`s, so that we know that if any mapping fails to parse as a `char` we should use the value for lookup in the multi-table. This avoids the second binary search in cases where a multi-`char` mapping is needed. Idea from @nikic
2023-03-16Split unicode case LUTs in single and multi variantsMartin Gammelsæter-13/+45
The majority of char case replacements are single char replacements, so storing them as [char; 3] wastes a lot of space. This commit splits the replacement tables for both `to_lower` and `to_upper` into two separate tables, one with single-character mappings and one with multi-character mappings. This reduces the binary size for programs using all of these tables with roughly 24K bytes.
2023-03-15Skip serializing ascii chars in case LUTsMartin Gammelsæter-14/+11
Since ascii chars are already handled by a special case in the `to_lower` and `to_upper` functions, there's no need to waste space on them in the LUTs.
2022-09-04Address feedback from PR #101401Sage Mitchell-4/+8
2022-09-04Make `char::is_lowercase` and `char::is_uppercase` constSage Mitchell-10/+16
Implements #101400.
2022-08-28Auto merge of #100497 - kadiwa4:remove_clone_into_iter, r=cjgillotbors-6/+2
Avoid cloning a collection only to iterate over it `@rustbot` label: +C-cleanup
2022-08-27Rollup merge of #100924 - est31:closure_to_fn_ptr, r=Mark-SimulacrumYuki Okushi-17/+16
Smaller improvements of tidy and the unicode generator
2022-08-23Change hint to correct pathest31-1/+1
2022-08-23Simplify unicode_downloads.rsest31-16/+15
Reduce duplication by moving fetching logic into a dedicated function.
2022-08-13avoid cloning and then iteratingKaDiWa-6/+2
2022-07-20add #inlineBruce A. MacNaughton-0/+1
2022-07-19formattedBruce A. MacNaughton-34/+20
2022-07-19working updatesBruce A. MacNaughton-2/+108