| Age | Commit message (Collapse) | Author | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Remove duplicated code from Iterator::{ne, lt, le, gt, ge}
This PR delegates `Iterator::ne` to `Iterator::eq` and `Iterator::{lt, le, gt, ge}` to `Iterator::partial_cmp`.
Oddly enough, this change actually simplifies the generated assembly [in some cases](https://rust.godbolt.org/z/riBtNe), although I don't understand assembly well enough to see if the longer assembly is doing something clever.
I also added two extremely simple benchmarks:
```
// before
test iter::bench_lt ... bench: 98,404 ns/iter (+/- 21,008)
test iter::bench_partial_cmp ... bench: 62,437 ns/iter (+/- 5,009)
// after
test iter::bench_lt ... bench: 61,757 ns/iter (+/- 8,770)
test iter::bench_partial_cmp ... bench: 62,151 ns/iter (+/- 13,753)
```
I have no idea why the current `lt`/`le`/`gt`/`ge` implementations don't seem to be compiled optimally, but simply having them call `partial_cmp` seems to be an improvement.
See #44729 for a previous discussion.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
name old ns/iter new ns/iter diff ns/iter diff % speedup
iter::bench_cycle_take_ref_sum 927,152 927,194 42 0.00% x 1.00
iter::bench_cycle_take_sum 938,129 603,492 -334,637 -35.67% x 1.55
|
|
|
|
|
|
|
|
This permits easier iteration without having to worry about warnings
being denied.
Fixes #49517
|
|
Merge `feature(advanced_slice_patterns)` into `feature(slice_patterns)`
|
|
Makes the bench asked about on URLO 58x faster :)
|
|
Short-circuiting internal iteration with Iterator::try_fold & try_rfold
These are the core methods in terms of which the other methods (`fold`, `all`, `any`, `find`, `position`, `nth`, ...) can be implemented, allowing Iterator implementors to get the full goodness of internal iteration by only overriding one method (per direction).
Based off the `Try` trait, so works with both `Result` and `Option` (:tada: https://github.com/rust-lang/rust/pull/42526). The `try_fold` rustdoc examples use `Option` and the `try_rfold` ones use `Result`.
AKA continuing in the vein of PRs https://github.com/rust-lang/rust/pull/44682 & https://github.com/rust-lang/rust/pull/44856 for more of `Iterator`.
New bench following the pattern from the latter of those:
```
test iter::bench_take_while_chain_ref_sum ... bench: 1,130,843 ns/iter (+/- 25,110)
test iter::bench_take_while_chain_sum ... bench: 362,530 ns/iter (+/- 391)
```
I also ran the benches without the `fold` & `rfold` overrides to test their new default impls, with basically no change. I left them there, though, to take advantage of existing overrides and because `AlwaysOk` has some sub-optimality due to https://github.com/rust-lang/rust/issues/43278 (which 45225 should fix).
If you're wondering why there are three type parameters, see issue https://github.com/rust-lang/rust/issues/45462
Thanks for @bluss for the [original IRLO thread](https://internals.rust-lang.org/t/pre-rfc-fold-ok-is-composable-internal-iteration/4434) and the rfold PR and to @cuviper for adding so many folds, [encouraging me](https://github.com/rust-lang/rust/pull/45379#issuecomment-339424670) to make this PR, and finding a catastrophic bug in a pre-review.
|
|
unpredictable conditional branches in the loop. In addition improve the
benchmarks to test performance in l1, l2 and l3 caches on sorted arrays
with or without dups.
Before:
```
test slice::binary_search_l1 ... bench: 48 ns/iter (+/- 1)
test slice::binary_search_l2 ... bench: 63 ns/iter (+/- 0)
test slice::binary_search_l3 ... bench: 152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups ... bench: 36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups ... bench: 64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups ... bench: 153 ns/iter (+/- 6)
```
After:
```
test slice::binary_search_l1 ... bench: 15 ns/iter (+/- 0)
test slice::binary_search_l2 ... bench: 23 ns/iter (+/- 0)
test slice::binary_search_l3 ... bench: 100 ns/iter (+/- 17)
test slice::binary_search_l1_with_dups ... bench: 15 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups ... bench: 23 ns/iter (+/- 0)
test slice::binary_search_l3_with_dups ... bench: 98 ns/iter (+/- 14)
```
|
|
This is the core method in terms of which the other methods (fold, all, any, find, position, nth, ...) can be implemented, allowing Iterator implementors to get the full goodness of internal iteration by only overriding one method (per direction).
|
|
|
|
remove FIXME(#13101) since `assert_receiver_is_total_eq` stays.
remove FIXME(#19649) now that stability markers render.
remove FIXME(#13642) now the benchmarks were moved.
remove FIXME(#6220) now that floating points can be formatted.
remove FIXME(#18248) and write tests for `Rc<str>` and `Rc<[u8]>`
remove reference to irelevent issues in FIXME(#1697, #2178...)
update FIXME(#5516) to point to getopts issue 7
update FIXME(#7771) to point to RFC 628
update FIXME(#19839) to point to issue 26925
|
|
Many of the iterator adaptors will perform faster folds if they forward
to their inner iterator's folds, especially for inner types like `Chain`
which are optimized too. The following types are newly specialized:
| Type | `fold` | `rfold` |
| ----------- | ------ | ------- |
| `Enumerate` | ✓ | ✓ |
| `Filter` | ✓ | ✓ |
| `FilterMap` | ✓ | ✓ |
| `FlatMap` | exists | ✓ |
| `Fuse` | ✓ | ✓ |
| `Inspect` | ✓ | ✓ |
| `Peekable` | ✓ | N/A¹ |
| `Skip` | ✓ | N/A² |
| `SkipWhile` | ✓ | N/A¹ |
¹ not a `DoubleEndedIterator`
² `Skip::next_back` doesn't pull skipped items at all, but this couldn't
be avoided if `Skip::rfold` were to call its inner iterator's `rfold`.
Benchmarks
----------
In the following results, plain `_sum` computes the sum of a million
integers -- note that `sum()` is implemented with `fold()`. The
`_ref_sum` variants do the same on a `by_ref()` iterator, which is
limited to calling `next()` one by one, without specialized `fold`.
The `chain` variants perform the same tests on two iterators chained
together, to show a greater benefit of forwarding `fold` internally.
test iter::bench_enumerate_chain_ref_sum ... bench: 2,216,264 ns/iter (+/- 29,228)
test iter::bench_enumerate_chain_sum ... bench: 922,380 ns/iter (+/- 2,676)
test iter::bench_enumerate_ref_sum ... bench: 476,094 ns/iter (+/- 7,110)
test iter::bench_enumerate_sum ... bench: 476,438 ns/iter (+/- 3,334)
test iter::bench_filter_chain_ref_sum ... bench: 2,266,095 ns/iter (+/- 6,051)
test iter::bench_filter_chain_sum ... bench: 745,594 ns/iter (+/- 2,013)
test iter::bench_filter_ref_sum ... bench: 889,696 ns/iter (+/- 1,188)
test iter::bench_filter_sum ... bench: 667,325 ns/iter (+/- 1,894)
test iter::bench_filter_map_chain_ref_sum ... bench: 2,259,195 ns/iter (+/- 353,440)
test iter::bench_filter_map_chain_sum ... bench: 1,223,280 ns/iter (+/- 1,972)
test iter::bench_filter_map_ref_sum ... bench: 611,607 ns/iter (+/- 2,507)
test iter::bench_filter_map_sum ... bench: 611,610 ns/iter (+/- 472)
test iter::bench_fuse_chain_ref_sum ... bench: 2,246,106 ns/iter (+/- 22,395)
test iter::bench_fuse_chain_sum ... bench: 634,887 ns/iter (+/- 1,341)
test iter::bench_fuse_ref_sum ... bench: 444,816 ns/iter (+/- 1,748)
test iter::bench_fuse_sum ... bench: 316,954 ns/iter (+/- 2,616)
test iter::bench_inspect_chain_ref_sum ... bench: 2,245,431 ns/iter (+/- 21,371)
test iter::bench_inspect_chain_sum ... bench: 631,645 ns/iter (+/- 4,928)
test iter::bench_inspect_ref_sum ... bench: 317,437 ns/iter (+/- 702)
test iter::bench_inspect_sum ... bench: 315,942 ns/iter (+/- 4,320)
test iter::bench_peekable_chain_ref_sum ... bench: 2,243,585 ns/iter (+/- 12,186)
test iter::bench_peekable_chain_sum ... bench: 634,848 ns/iter (+/- 1,712)
test iter::bench_peekable_ref_sum ... bench: 444,808 ns/iter (+/- 480)
test iter::bench_peekable_sum ... bench: 317,133 ns/iter (+/- 3,309)
test iter::bench_skip_chain_ref_sum ... bench: 1,778,734 ns/iter (+/- 2,198)
test iter::bench_skip_chain_sum ... bench: 761,850 ns/iter (+/- 1,645)
test iter::bench_skip_ref_sum ... bench: 478,207 ns/iter (+/- 119,252)
test iter::bench_skip_sum ... bench: 315,614 ns/iter (+/- 3,054)
test iter::bench_skip_while_chain_ref_sum ... bench: 2,486,370 ns/iter (+/- 4,845)
test iter::bench_skip_while_chain_sum ... bench: 633,915 ns/iter (+/- 5,892)
test iter::bench_skip_while_ref_sum ... bench: 666,926 ns/iter (+/- 804)
test iter::bench_skip_while_sum ... bench: 444,405 ns/iter (+/- 571)
|
|
`FlatMap` can use internal iteration for its `fold`, which shows a
performance advantage in the new benchmarks:
test iter::bench_flat_map_chain_ref_sum ... bench: 4,354,111 ns/iter (+/- 108,871)
test iter::bench_flat_map_chain_sum ... bench: 468,167 ns/iter (+/- 2,274)
test iter::bench_flat_map_ref_sum ... bench: 449,616 ns/iter (+/- 6,257)
test iter::bench_flat_map_sum ... bench: 348,010 ns/iter (+/- 1,227)
... where the "ref" benches are using `by_ref()` that isn't optimized.
So this change shows a decent advantage on its own, but much more when
combined with a `chain` iterator that also optimizes `fold`.
|
|
The benefit of using internal iteration is shown in new benchmarks:
test iter::bench_for_each_chain_fold ... bench: 635,110 ns/iter (+/- 5,135)
test iter::bench_for_each_chain_loop ... bench: 2,249,983 ns/iter (+/- 42,001)
test iter::bench_for_each_chain_ref_fold ... bench: 2,248,061 ns/iter (+/- 51,940)
|
|
We have benchmarks for the floating-point formatting algorithms
themselves, but not for the surrounding machinery like Formatter and
translating to the flt2dec::Part slices.
|
|
And libcore/benches
|