diff options
| author | bors <bors@rust-lang.org> | 2014-11-28 15:11:24 +0000 |
|---|---|---|
| committer | bors <bors@rust-lang.org> | 2014-11-28 15:11:24 +0000 |
| commit | 29e928f2ba3501d37660314f6186d0e2ac18b9db (patch) | |
| tree | 0494b6be13765bfa725addca9eb8d8645d25478e | |
| parent | f33d879a7094bce7e16345dcc2efa85da6f05261 (diff) | |
| parent | 4d1cb7820de50899f2009da20f83b639df2873f0 (diff) | |
| download | rust-29e928f2ba3501d37660314f6186d0e2ac18b9db.tar.gz rust-29e928f2ba3501d37660314f6186d0e2ac18b9db.zip | |
auto merge of #19345 : steveklabnik/rust/gh19344, r=alexcrichton
Fixes #19344
| -rw-r--r-- | src/doc/complement-lang-faq.md | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/src/doc/complement-lang-faq.md b/src/doc/complement-lang-faq.md index 0a8f9b2ffaa..62faecede55 100644 --- a/src/doc/complement-lang-faq.md +++ b/src/doc/complement-lang-faq.md @@ -108,7 +108,7 @@ The `str` type is UTF-8 because we observe more text in the wild in this encodin This does mean that indexed access to a Unicode codepoint inside a `str` value is an O(n) operation. On the one hand, this is clearly undesirable; on the other hand, this problem is full of trade-offs and we'd like to point a few important qualifications: -* Scanning a `str` for ASCII-range codepoints can still be done safely octet-at-a-time, with each indexing operation pulling out a `u8` costing only O(1) and producing a value that can be cast and compared to an ASCII-range `char`. So if you're (say) line-breaking on `'\n'`, octet-based treatment still works. UTF8 was well-designed this way. +* Scanning a `str` for ASCII-range codepoints can still be done safely octet-at-a-time. If you use `.as_bytes()`, pulling out a `u8` costs only O(1) and produces a value that can be cast and compared to an ASCII-range `char`. So if you're (say) line-breaking on `'\n'`, octet-based treatment still works. UTF8 was well-designed this way. * Most "character oriented" operations on text only work under very restricted language assumptions sets such as "ASCII-range codepoints only". Outside ASCII-range, you tend to have to use a complex (non-constant-time) algorithm for determining linguistic-unit (glyph, word, paragraph) boundaries anyways. We recommend using an "honest" linguistically-aware, Unicode-approved algorithm. * The `char` type is UCS4. If you honestly need to do a codepoint-at-a-time algorithm, it's trivial to write a `type wstr = [char]`, and unpack a `str` into it in a single pass, then work with the `wstr`. In other words: the fact that the language is not "decoding to UCS4 by default" shouldn't stop you from decoding (or re-encoding any other way) if you need to work with that encoding. |
