about summary refs log tree commit diff
path: root/library/core/src/unicode
AgeCommit message (Collapse)AuthorLines
2025-09-07optimization: Don't include ASCII characters in Unicode tablesKarl Meakin-243/+276
The ASCII subset of Unicode is fixed and will never change, so we don't need to generate tables for it with every new Unicode version. This saves a few bytes of static data and speeds up `char::is_control` and `char::is_grapheme_extended` on ASCII inputs. Since the table lookup functions exported from the `unicode` module will give nonsensical errors on ASCII input (and in fact will panic in debug mode), I had to add some private wrapper methods to `char` which check for ASCII-ness first.
2025-09-05change file-is-generated doc comment to innerMarijn Schouten-1/+1
2025-09-03Rollup merge of #145414 - Kmeakin:km/unicode-table-refactors, ↵Stuart Cook-4/+16
r=joshtriplett,tgross35 unicode-table-generator refactors Split off from https://github.com/rust-lang/rust/pull/145219
2025-08-30Auto merge of #145479 - Kmeakin:km/hardcode-char-is-control, r=joboetbors-26/+0
Hard-code `char::is_control` Split off from https://github.com/rust-lang/rust/pull/145219 According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table. This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.
2025-08-16refactor: Hard-code `char::is_control`Karl Meakin-26/+0
According to https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.
2025-08-15refactor: Include size of case conversion tablesKarl Meakin-5/+7
Include the sizes of the `to_lowercase` and `to_uppercase` tables in the total size calculations.
2025-08-15refactor: Include table sizes in comment at top of `unicode_data.rs`Karl Meakin-0/+10
To make changes in table size obvious from git diffs
2025-08-13Hide docs for core::unicodeltdk-2/+2
2025-07-10Remove uncessary parens in closure body with unused lintyukang-1/+1
2025-03-08Remove unneeded parentheses.Markus Reiter-6/+6
2025-03-07Use `intrinsics::assume` instead of `hint::assert_unchecked`.Markus Reiter-2/+8
2025-03-07Never inline `lookup_slow`.Markus Reiter-0/+2
2025-03-06Add second precondition for `skip_search`.Markus Reiter-57/+205
2025-03-06Allow optimizing out `panic_bounds_check` in Unicode checks.Markus Reiter-39/+34
2025-01-20core: add `#![warn(unreachable_pub)]`Urgau-0/+2
2024-12-04Reformat Python code with `ruff`Jakub Beránek-19/+34
2024-11-27update cfgsBoxy-3/+0
2024-11-12stabilize const_unicode_case_lookupRalf Jung-0/+3
2024-11-06Auto merge of #132500 - RalfJung:char-is-whitespace-const, r=jhprattbors-1/+1
make char::is_whitespace unstably const I am adding this to the existing https://github.com/rust-lang/rust/issues/132241 feature gate, since `is_digit` and `is_whitespace` seem similar enough that one can group them together.
2024-11-03Rollup merge of #132499 - RalfJung:unicode_data.rs, r=tgross35Matthias Krüger-1/+1
unicode_data.rs: show command for generating file https://github.com/rust-lang/rust/pull/131647 made this an easily runnable tool, now we just have to mention that in the comment. :) Fixes https://github.com/rust-lang/rust/issues/131640.
2024-11-02make char::is_whitespace unstably constRalf Jung-1/+1
2024-11-02unicode_data.rs: show command for generating fileRalf Jung-1/+1
2024-11-02get rid of a whole bunch of unnecessary rustc_const_unstable attributesRalf Jung-3/+0
2024-10-13switch unicode-data back to 'static'Ralf Jung-8/+8
2024-09-12Rollup merge of #130101 - RalfJung:const-cleanup, r=fee1-deadMatthias Krüger-4/+2
some const cleanup: remove unnecessary attributes, add const-hack indications I learned that we use `FIXME(const-hack)` on top of the "const-hack" label. That seems much better since it marks the right place in the code and moves around with the code. So I went through the PRs with that label and added appropriate FIXMEs in the code. IMO this means we can then remove the label -- Cc ``@rust-lang/wg-const-eval.`` I also noticed some const stability attributes that don't do anything useful, and removed them. r? ``@fee1-dead``
2024-09-10Bump unicode printable to version 16.0.0Marcondiro-57/+73
2024-09-10Bump unicode_data to version 16.0.0Marcondiro-651/+670
2024-09-08add FIXME(const-hack)Ralf Jung-4/+2
2024-07-19Use `#[rustfmt::skip]` on some `use` groups to prevent reordering.Nicholas Nethercote-4/+6
`use` declarations will be reformatted in #125443. Very rarely, there is a desire to force a group of `use` declarations together in a way that auto-formatting will break up. E.g. when you want a single comment to apply to a group. #126776 dealt with all of these in the codebase, ensuring that no comments intended for multiple `use` declarations would end up in the wrong place. But some people were unhappy with it. This commit uses `#[rustfmt::skip]` to create these custom `use` groups in an idiomatic way for a few of the cases changed in #126776. This works because rustfmt treats any `use` item annotated with `#[rustfmt::skip]` as a barrier and won't reorder other `use` items around it.
2024-07-17Avoid comments that describe multiple `use` items.Nicholas Nethercote-13/+13
There are some comments describing multiple subsequent `use` items. When the big `use` reformatting happens some of these `use` items will be reordered, possibly moving them away from the comment. With this additional level of formatting it's not really feasible to have comments of this type. This commit removes them in various ways: - merging separate `use` items when appropriate; - inserting blank lines between the comment and the first `use` item; - outright deletion (for comments that are relatively low-value); - adding a separate "top-level" comment. We also entirely skip formatting for four library files that contain nothing but `pub use` re-exports, where reordering would be painful.
2024-04-20Add a lower bound check to `unicode-table-generator` outputArpad Borsos-0/+4
This adds a dedicated check for the lower bound (if it is outside of ASCII range) to the output of the `unicode-table-generator` tool. This generalized the ASCII-only fast-path, but only for the `Grapheme_Extend` property for now, as that is the only one with a lower bound outside of ASCII.
2024-03-28Bump Unicode printables to version 15.1, align to unicode_dataMarcondiro-12/+14
2024-02-09Bump Unicode to version 15.1.0, regenerate tablesMarcondiro-6/+6
2023-06-16Apply changes to fix python linting errorsTrevor Gross-1/+1
2023-03-21Use hex literal for INDEX_MASKMartin Gammelsæter-1/+1
2023-03-16Improve case mapping encoding schemeMartin Gammelsæter-1045/+779
The indices are encoded as `u32`s in the range of invalid `char`s, so that we know that if any mapping fails to parse as a `char` we should use the value for lookup in the multi-table. This avoids the second binary search in cases where a multi-`char` mapping is needed. Idea from @nikic
2023-03-16Split unicode case LUTs in single and multi variantsMartin Gammelsæter-1682/+963
The majority of char case replacements are single char replacements, so storing them as [char; 3] wastes a lot of space. This commit splits the replacement tables for both `to_lower` and `to_upper` into two separate tables, one with single-character mappings and one with multi-character mappings. This reduces the binary size for programs using all of these tables with roughly 24K bytes.
2023-03-15Skip serializing ascii chars in case LUTsMartin Gammelsæter-26/+0
Since ascii chars are already handled by a special case in the `to_lower` and `to_upper` functions, there's no need to waste space on them in the LUTs.
2022-12-30Replace libstd, libcore, liballoc in line comments.jonathanCogan-1/+1
2022-09-14Bump Unicode to version 15.0.0, regenerate tablesThom Chiovoloni-173/+190
2022-09-04Address feedback from PR #101401Sage Mitchell-8/+12
2022-09-04Make `char::is_lowercase` and `char::is_uppercase` constSage Mitchell-15/+18
Implements #101400.
2022-07-20add #inlineBruce A. MacNaughton-0/+1
2022-07-19generated codeBruce A. MacNaughton-10/+17
2022-05-31Add unicode fast path to `is_printable`Nilstrieb-4/+18
Before, it would enter the full expensive check even for normal ascii characters. Now, it skips the check for the ascii characters in `32..127`. This range was checked manually from the current behavior.
2021-10-06Regenerate tables for Unicode 14.0.0Josh Stone-553/+653
2021-06-23Use HTTPS links where possibleSmitty-2/+2
2021-02-26Add a check for ASCII characters in to_upper and to_lowerMiccah Castorina-6/+14
This extra check has better performance. See discussion here: https://internals.rust-lang.org/t/to-upper-speed/13896
2020-12-07Privatize some of libcore unicode_internalsAleksey Kladov-13/+10
My understanding is that these API are perma unstable, so it doesn't make sense to pollute docs & IDE completion[1] with them. [1]: https://github.com/rust-analyzer/rust-analyzer/issues/6738
2020-07-27mv std libs to library/mark-0/+3103