rust - https://github.com/rust-lang/rust

Age	Commit message (Collapse)	Author	Lines
2020-07-27	mv std libs to library/	mark	-3103/+0

2020-04-23	Stabilize UNICODE_VERSION (feature unicode_version)	Pyfisch	-1/+6
	The feature will become stable in Rust 1.45. Noted that the value of UNICODE_VERSION is expected to change.
2020-04-11	Store UNICODE_VERSION as a tuple	Pyfisch	-28/+5
	Remove the UnicodeVersion struct containing major, minor and update fields and replace it with a 3-tuple containing the version number. As the value of each field is limited to 255 use u8 to store them.
2020-03-27	Remove separate encoding for a single nonzero-mapping byte	Mark Rousskov	-15/+7
	In practice, for the two data sets that still use the bitset encoding (uppercase and lowercase) this is not a significant win, so just drop it entirely. It costs us about 5 bytes, and the complexity is nontrivial.
2020-03-27	Add skip list based implementation for smaller encoding	Mark Rousskov	-777/+244
	This arranges for the sparser sets (everything except lower and uppercase) to be encoded in a significantly smaller context. However, it is also a performance trade-off (roughly 3x slower than the bitset encoding). The 40% size reduction is deemed to be sufficiently important to merit this performance loss, particularly as it is unlikely that this code is hot anywhere (and if it is, paying the memory cost for a bitset that directly represents the data seems worthwhile). Alphabetic : 1599 bytes (- 937 bytes) Case_Ignorable : 949 bytes (- 822 bytes) Cased : 359 bytes (- 429 bytes) Cc : 9 bytes (- 15 bytes) Grapheme_Extend: 813 bytes (- 675 bytes) Lowercase : 863 bytes N : 419 bytes (- 619 bytes) Uppercase : 776 bytes White_Space : 37 bytes (- 46 bytes) Total table sizes: 5824 bytes (-3543 bytes)
2020-03-21	Avoid relying on const parameters to function	Mark Rousskov	-4/+4
	LLVM seems to at least sometimes optimize better when the length comes directly from the `len()` of the array vs. an equivalent integer. Also, this allows easier copy/pasting of the function into compiler explorer for experimentation.
2020-03-21	Arrange for zero to be canonical	Mark Rousskov	-252/+227
	We find that it is common for large ranges of chars to be false -- and that means that it is plausibly common for us to ask about a word that is entirely empty. Therefore, we should make sure that we do not need to rotate bits or otherwise perform some operation to map to the zero word; canonicalize it first if possible.
2020-03-21	Push the byte of LAST_CHUNK_MAP into the array	Mark Rousskov	-35/+35
	This optimizes slightly better. Alphabetic : 2536 bytes Case_Ignorable : 1771 bytes Cased : 788 bytes Cc : 24 bytes Grapheme_Extend: 1488 bytes Lowercase : 863 bytes N : 1038 bytes Uppercase : 776 bytes White_Space : 83 bytes Total table sizes: 9367 bytes (-18 bytes; 2 bytes per set)
2020-03-21	Deduplicate test and primary range_search definitions	Mark Rousskov	-46/+50
	This ensures that what we test is what we get for final results as well.
2020-03-21	Add a right shift mapping	Mark Rousskov	-825/+782
	This saves less bytes - by far - and is likely not the best operator to choose. But for now, it works -- a better choice may arise later. Alphabetic : 2538 bytes (- 84 bytes) Case_Ignorable : 1773 bytes (- 30 bytes) Cased : 790 bytes (- 18 bytes) Cc : 26 bytes (- 6 bytes) Grapheme_Extend: 1490 bytes (- 18 bytes) Lowercase : 865 bytes (- 36 bytes) N : 1040 bytes (- 24 bytes) Uppercase : 778 bytes (- 60 bytes) White_Space : 85 bytes (- 6 bytes) Total table sizes: 9385 bytes (-282 bytes)
2020-03-21	Shrink bitset words through functional mapping	Mark Rousskov	-413/+962
	Previously, all words in the (deduplicated) bitset would be stored raw -- a full 64 bits (8 bytes). Now, those words that are equivalent to others through a specific mapping are stored separately and "mapped" to the original when loading; this shrinks the table sizes significantly, as each mapped word is stored in 2 bytes (a 4x decrease from the previous). The new encoding is also potentially non-optimal: the "mapped" byte is frequently repeated, as in practice many mapped words use the same base word. Currently we only support two forms of mapping: rotation and inversion. Note that these are both guaranteed to map transitively if at all, and supporting mappings for which this is not true may require a more interesting algorithm for choosing the optimal pairing. Updated sizes: Alphabetic : 2622 bytes (- 414 bytes) Case_Ignorable : 1803 bytes (- 330 bytes) Cased : 808 bytes (- 126 bytes) Cc : 32 bytes Grapheme_Extend: 1508 bytes (- 252 bytes) Lowercase : 901 bytes (- 84 bytes) N : 1064 bytes (- 156 bytes) Uppercase : 838 bytes (- 96 bytes) White_Space : 91 bytes (- 6 bytes) Total table sizes: 9667 bytes (-1,464 bytes)
2020-03-20	Pre-pop zero chunks before mapping LAST_CHUNK_MAP	Mark Rousskov	-88/+72
	This avoids wasting a small amount of space for some of the data sets. The chunk resizing is caused by but not directly related to changes in this commit. Alphabetic : 3036 bytes Case_Ignorable : 2133 bytes (- 3 bytes) Cased : 934 bytes Cc : 32 bytes Grapheme_Extend: 1760 bytes (-14 bytes) Lowercase : 985 bytes N : 1220 bytes (- 5 bytes) Uppercase : 934 bytes White_Space : 97 bytes Total table sizes: 11131 bytes (-22 bytes)
2020-03-20	Dynamically choose best chunk size	Mark Rousskov	-119/+95
	Try chunk sizes between 1 and 64, selecting the one which minimizes the number of bytes used. 16, the previous constant, turned out to be a rather good choice, with 5/9 of the datasets still using it. Alphabetic : 3036 bytes (- 19 bytes) Case_Ignorable : 2136 bytes Cased : 934 bytes Cc : 32 bytes (- 11 bytes) Grapheme_Extend: 1774 bytes Lowercase : 985 bytes N : 1225 bytes (- 41 bytes) Uppercase : 934 bytes White_Space : 97 bytes (- 43 bytes) Total table sizes: 11153 bytes (-114 bytes)
2020-03-11	Regenerate tables for Unicode 13.0.0	Josh Stone	-445/+462

2020-01-14	Replace old tables with new unicode data	Mark Rousskov	-3179/+2343

2020-01-14	Add support code for new unicode_data module	Mark Rousskov	-5/+49

2019-12-21	Require issue = "none" over issue = "0" in unstable attributes	Ross MacArthur	-1/+1

2019-11-29	Make libcore/unicode/tables.rs compatible with rustfmt	David Tolnay	-29/+36

2019-11-29	Make libcore/unicode/printable.rs compatible with rustfmt	David Tolnay	-4/+11

2019-11-26	Format libcore with rustfmt	David Tolnay	-6/+6
	This commit applies rustfmt with default settings to files in src/libcore that are not involved in any currently open PR to minimize merge conflicts. The list of files involved in open PRs was determined by querying GitHub's GraphQL API with this script: https://gist.github.com/dtolnay/aa9c34993dc051a4f344d1b10e4487e8 With the list of files from the script in `outstanding_files`, the relevant commands were: $ find src/libcore -name '*.rs' \| xargs rustfmt --edition=2018 $ rg libcore outstanding_files \| xargs git checkout -- Repeating this process several months apart should get us coverage of most of the rest of libcore.
2019-09-06	it's more pythonic to use 'is not None' in python files	Guanqun Lu	-1/+1

2019-09-04	remove XID and Pattern_White_Space unicode tables from libcore	Aleksey Kladov	-385/+4
	They are only used by rustc_lexer, and are not needed elsewhere. So we move the relevant definitions into rustc_lexer (while the actual unicode data comes from the unicode-xid crate) and make the rest of the compiler use it.
2019-08-05	Make some items in core::unicode private	Matthew Jasper	-20/+20
	They were reachable through opaque macros defined in `core`
2019-07-26	Rollup merge of #62084 - euclio:unicode-table-tweak, r=kennytm	Mazdak Farrokhzad	-2/+2
	allow clippy::unreadable_literal in unicode tables Also modifies the generation script to emit 2018 edition paths.
2019-07-12	allow clippy::unreadable_literal in unicode tables	Andy Russell	-4/+4
	Also modifies the generation script to emit 2018 edition paths.
2019-07-12	Regenerate character tables for Unicode 12.1	Josh Stone	-734/+765

2019-07-12	Update unicode scripts for the current coding style	Josh Stone	-5/+5

2019-07-06	Rollup merge of #60081 - pawroman:cleanup_unicode_script, r=varkor	Mazdak Farrokhzad	-352/+740
	Refactor unicode.py script Hi, I noticed that the `unicode.py` script used some deprecated escapes in regular expressions. E.g. `\d`, `\w`, `\.` will be illegal in the future without "raw strings". This is now fixed. I have also cleaned up the script quite a bit. ## Escape deprecation OK (note the `r`): `re.compile(r"\d")` Deprecated (from Python 3.6 onwards, see [here][link1] and [here][link2]): `re.compile("\d")`. [link1]: https://docs.python.org/3.6/whatsnew/3.6.html#deprecated-python-behavior [link2]: https://bugs.python.org/issue27364 This was evident running the script using Python 3.7 like so: ``` $ python3 -Wall unicode.py unicode.py:227: DeprecationWarning: invalid escape sequence \w re1 = re.compile("^ ([0-9A-F]+) ; (\w+)") unicode.py:228: DeprecationWarning: invalid escape sequence \. re2 = re.compile("^ ([0-9A-F]+)\.\.([0-9A-F]+) ; (\w+)") unicode.py:453: DeprecationWarning: invalid escape sequence \d pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode" ``` The documentation states that > A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. ## Testing To test my changes, I had to add support for choosing the Unicode version to use. The script will default to latest release (which is 12.0.0 at the moment, repo has 11.0.0 checked in). The script generates the exact same output for version 11.0.0 with Python 2.7 and 3.7 and no longer generates any deprecation warnings: ``` $ python3 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 -Wall unicode.py -v 11.0.0 Using Unicode version: 11.0.0 Regenerated tables.rs. $ git diff tables.rs $ python2 --version Python 2.7.16 $ python3 --version Python 3.7.3 ``` ## Extra functionality Furthermore, the script will check and download the latest Unicode version by default (without the `-v` argument). The `--help` is below: ``` $ ./unicode.py --help usage: unicode.py [-h] [-v VERSION] Regenerate Unicode tables (tables.rs). optional arguments: -h, --help show this help message and exit -v VERSION, --version VERSION Unicode version to use (if not specified, defaults to latest available final release). ``` ## Cleanups I have cleaned up the code quite a bit, with Python best practices and code style in mind. I'm happy to provide more details and rationale for all my changes if the reviewers so desire. One externally visible change is that the Unicode data will now be downloaded into `src/libcore/unicode/downloaded` directory suffixed by Unicode version: ``` $ pwd .../rust/src/libcore/unicode $ exa -T downloaded/ downloaded ├── 11.0.0 │ ├── DerivedCoreProperties.txt │ ├── DerivedNormalizationProps.txt │ ├── PropList.txt │ ├── ReadMe.txt │ ├── Scripts.txt │ ├── SpecialCasing.txt │ └── UnicodeData.txt └── 12.0.0 ├── DerivedCoreProperties.txt ├── DerivedNormalizationProps.txt ├── PropList.txt ├── ReadMe.txt ├── Scripts.txt ├── SpecialCasing.txt └── UnicodeData.txt ```
2019-07-01	Address review remarks in unicode.py	Paweł Romanowski	-55/+61

2019-06-10	Apply suggestions from code review	Paweł Romanowski	-4/+5
	Co-Authored-By: varkor <github@varkor.com>
2019-04-19	Refactor and document unicode.py script	Paweł Romanowski	-302/+518

2019-04-18	Fix tidy errors	Paweł Romanowski	-2/+3

2019-04-18	More cleanups for unicode.py	Paweł Romanowski	-25/+23

2019-04-18	Clean up unicode.py script	Paweł Romanowski	-103/+269

2019-04-18	libcore => 2018	Taiki Endo	-5/+5

2018-12-25	Remove licenses	Mark Rousskov	-90/+1

2018-12-04	cleanup: remove static lifetimes from consts	ljedrz	-6/+6

2018-11-10	revert making internal APIs const fn.	Mazdak Farrokhzad	-1/+1

2018-11-10	constify parts of libcore.	Mazdak Farrokhzad	-2/+1

2018-08-01	Auto merge of #51609 - dscorbett:is_numeric, r=alexcrichton	bors	-30/+44
	Treat gc=No characters as numeric [`char::is_numeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_numeric) and [`char::is_alphanumeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_alphanumeric) are documented to be defined “in terms of the Unicode General Categories 'Nd', 'Nl', 'No'”, but unicode.py does not group 'No' with the other 'N' categories. These functions therefore currently return `false` for characters like ⟨¾⟩ and ⟨①⟩.
2018-07-06	Handle array manually in string case conversion methods	Pazzaz	-0/+3

2018-06-17	Treat gc=No characters as numeric	David Corbett	-30/+44

2018-06-11	Regenerate character tables for Unicode 11	Josh Stone	-1120/+1214

2018-05-21	Fix tables.rs	varkor	-6/+45

2018-05-21	Avoid counting characters and add explanatory comment to test	varkor	-1/+1

2018-05-21	Use Grapheme_Extend instead of Mn	varkor	-166/+129

2018-05-21	Use the correct output directory for downloading Unicode files	varkor	-2/+1

2018-05-21	Escape combining characters in escape_debug	varkor	-1/+1

2018-05-21	Keep tables.rs copyright notice up to date	varkor	-5/+5

2018-05-21	Download unicode data files in directory of unicode.py	varkor	-7/+11