perf/factor ~ deduplicate divisors#1571
Conversation
|
Expected performance gain is ⪆20%, but I am missing precise measurements: the only computer I can use for this is currently my laptop, and as the weather got rather warm here, I'm having issues benchmarking precisely due to thermal throttling. Moreover, the performance figures in the commit messages are likely wrong and need to be measured again, as I reordered the commits. |
|
Thinking further about it, I should submit everything up to 25176ee (i.e. using a flat vector, eliding most heap allocations, etc.) as a separate PR that doesn't involve heavy maths or algorithmics. That should be easier to review, if I split it in a few PRs with narrower scope, and it means it can be reviewed/merged while I'm still working on this. :) |
|
Rebased on top of #1572 after those commits got extracted. |
|
Unrelated CI failure: |
|
Test failure should be fixed by #1586. |
|
Rebased to 'master'. |
newer changes
Overall, this looks to be ready for merge and narrows the performance gap with GNU $ ## 10,000,001 factorizations
$ hyperfine -L exe "factor,../../../target/release/factor,../../../target/release/coreutils factor" "seq 0 $((10 ** 7)) | {exe} > /dev/null"
Benchmark #1: seq 0 10000000 | factor > /dev/null
Time (mean ± σ): 2.974 s ± 0.088 s [User: 2.364 s, System: 0.809 s]
Range (min … max): 2.881 s … 3.117 s 10 runs
Benchmark #2: seq 0 10000000 | ../../../target/release/factor > /dev/null
Time (mean ± σ): 10.221 s ± 0.152 s [User: 10.139 s, System: 0.292 s]
Range (min … max): 9.901 s … 10.331 s 10 runs
Benchmark #3: seq 0 10000000 | ../../../target/release/coreutils factor > /dev/null
Time (mean ± σ): 10.585 s ± 0.141 s [User: 10.580 s, System: 0.204 s]
Range (min … max): 10.254 s … 10.693 s 10 runs
Summary
'seq 0 10000000 | factor > /dev/null' ran
3.44 ± 0.11 times faster than 'seq 0 10000000 | ../../../target/release/factor > /dev/null'
3.56 ± 0.12 times faster than 'seq 0 10000000 | ../../../target/release/coreutils factor > /dev/null'$ ## 101 factorizations
$ hyperfine -L exe "factor,../../../target/release/factor,../../../target/release/coreutils factor" "seq $((10 ** 7)) $((10 ** 5)) $((2 * (10 ** 7))) | {exe} > /dev/null"
Benchmark #1: seq 10000000 100000 20000000 | factor > /dev/null
Time (mean ± σ): 4.7 ms ± 0.5 ms [User: 0.8 ms, System: 6.9 ms]
Range (min … max): 3.7 ms … 6.6 ms 399 runs
Warning: Command took less than 5 ms to complete. Results might be inaccurate.
Benchmark #2: seq 10000000 100000 20000000 | ../../../target/release/factor > /dev/null
Time (mean ± σ): 6.8 ms ± 0.6 ms [User: 1.0 ms, System: 9.3 ms]
Range (min … max): 5.7 ms … 9.4 ms 279 runs
Benchmark #3: seq 10000000 100000 20000000 | ../../../target/release/coreutils factor > /dev/null
Time (mean ± σ): 7.5 ms ± 0.6 ms [User: 1.1 ms, System: 9.8 ms]
Range (min … max): 6.3 ms … 9.8 ms 283 runs
Summary
'seq 10000000 100000 20000000 | factor > /dev/null' ran
1.45 ± 0.21 times faster than 'seq 10000000 100000 20000000 | ../../../target/release/factor > /dev/null'
1.60 ± 0.22 times faster than 'seq 10000000 100000 20000000 | ../../../target/release/coreutils factor > /dev/null'Thanks to @nbraud for the initial heavy lifting! There are some unrelated new clippy warnings that I'll fix in a subsequent PR (see #1603). |
This way, we can easily replace u8 with a larger type when moving to support larger integers.
The new type can be used to represent in-progress factorisations, which contain non-prime factors.
~18% faster than BTreeMap, and ~5% faster than 'master'
~2.9% faster than the previous commit, ~11% faster than “master” overall.
~7% slowdown, paves the way for upcoming improvements
~17% faster, many optimisation opportunities still missed >:)
The invariant is checked by a debug_assert!, and follows from the previous
commit, as `dec` and `factors` only ever contains coprime numbers:
- true at the start: factor = ∅ and dec = { n¹ } ;
- on every loop iteration, we pull out an element `f` from `dec` and either:
- discover it is prime, and add it to `factors` ;
- split it into a number of coprime factors, that get reinserted into `dec`;
the invariant is maintained, as all divisors of `f` are coprime with all
numbers in `dec` and `factors` (as `f` itself is coprime with them.
As we only add elements to `Decomposition` objects that are coprime with the
existing ones, they are distinct.
This avoids allocating on the heap when factoring most numbers, without using much space on the stack. This is ~3.5% faster than the previous commit, and ~8.3% faster than “master”.
…time improvement)
|
@uutils/maintainers , if there are no objections, I'll merge this on Monday. |
perf/factor ~ deduplicate divisors
|
Closed via merge commit effb94b. |
factor()function andFactorsdatastructureDecomposition, a representation of divisor multisets, common tofactor()andFactors.Decomposition, first by switching to a flat vector representation, then by usingsmallvecto stack-allocate the space in most cases.fmtimplementation forFactors, avoiding a data copy to sort the factors.This is marked WiP, as the implementation is rather ugly and leaves some performance on the table.
Decompositionagain, knowing we do not ever add the same factor twice (so we can skip looking for it)