Skip to content

perf/factor ~ deduplicate divisors#1571

Closed
nicoonoclaste wants to merge 19 commits into
uutils:masterfrom
nicoonoclaste:factor/faster/dedup_primes
Closed

perf/factor ~ deduplicate divisors#1571
nicoonoclaste wants to merge 19 commits into
uutils:masterfrom
nicoonoclaste:factor/faster/dedup_primes

Conversation

@nicoonoclaste
Copy link
Copy Markdown
Contributor

  • Refactor (eheh) the factor() function and Factors datastructure
  • Introduce Decomposition, a representation of divisor multisets, common to factor() and Factors.
  • Optimise Decomposition, first by switching to a flat vector representation, then by using smallvec to stack-allocate the space in most cases.
  • Optimise the fmt implementation for Factors, avoiding a data copy to sort the factors.
  • Use a GCD-powered algorithm to decompose numbers into coprime factors, whenever a new factor is found.
    This is marked WiP, as the implementation is rather ugly and leaves some performance on the table.
  • Optimise Decomposition again, knowing we do not ever add the same factor twice (so we can skip looking for it)

@nicoonoclaste
Copy link
Copy Markdown
Contributor Author

Expected performance gain is ⪆20%, but I am missing precise measurements: the only computer I can use for this is currently my laptop, and as the weather got rather warm here, I'm having issues benchmarking precisely due to thermal throttling.

Moreover, the performance figures in the commit messages are likely wrong and need to be measured again, as I reordered the commits.

@nicoonoclaste
Copy link
Copy Markdown
Contributor Author

Thinking further about it, I should submit everything up to 25176ee (i.e. using a flat vector, eliding most heap allocations, etc.) as a separate PR that doesn't involve heavy maths or algorithmics.

That should be easier to review, if I split it in a few PRs with narrower scope, and it means it can be reviewed/merged while I'm still working on this. :)

@nicoonoclaste
Copy link
Copy Markdown
Contributor Author

Rebased on top of #1572 after those commits got extracted.

@nicoonoclaste
Copy link
Copy Markdown
Contributor Author

Unrelated CI failure:

failures:
    test_head::test_bug_in_negative_zero_lines
    test_head::test_unsupported_negative_byte_syntax
    test_head::test_unsupported_zero_terminated_syntax

@nicoonoclaste
Copy link
Copy Markdown
Contributor Author

Test failure should be fixed by #1586.

@rivy rivy self-assigned this Sep 15, 2020
@rivy
Copy link
Copy Markdown
Contributor

rivy commented Oct 6, 2020

Rebased to 'master'.
Tests fail occasionally at 'test_factor::test_random' and 'test_factor::test_random_big'.
Needs bug triage and repair before merge.

@rivy
Copy link
Copy Markdown
Contributor

rivy commented Oct 10, 2020

newer changes

  • bug has been tracked down and fixed (including relevent additional testing)
  • minor performance enhancement on top of @nbraud's changes (~5% speed improvement)
  • minor refactoring for comprehension/readability

Overall, this looks to be ready for merge and narrows the performance gap with GNU factor. With this PR, this implementation now takes about 3-4 times as much time as GNU factor when factoring large sets of numbers. For smaller (usual?) sets, setup and I/O are more dominant and the two executables are perceptually equal.

$ ## 10,000,001 factorizations
$ hyperfine -L exe "factor,../../../target/release/factor,../../../target/release/coreutils factor" "seq 0 $((10 ** 7)) | {exe} > /dev/null"
Benchmark #1: seq 0 10000000 | factor > /dev/null
  Time (mean ± σ):      2.974 s ±  0.088 s    [User: 2.364 s, System: 0.809 s]
  Range (min … max):    2.881 s …  3.117 s    10 runs

Benchmark #2: seq 0 10000000 | ../../../target/release/factor > /dev/null
  Time (mean ± σ):     10.221 s ±  0.152 s    [User: 10.139 s, System: 0.292 s]
  Range (min … max):    9.901 s … 10.331 s    10 runs

Benchmark #3: seq 0 10000000 | ../../../target/release/coreutils factor > /dev/null
  Time (mean ± σ):     10.585 s ±  0.141 s    [User: 10.580 s, System: 0.204 s]
  Range (min … max):   10.254 s … 10.693 s    10 runs

Summary
  'seq 0 10000000 | factor > /dev/null' ran
    3.44 ± 0.11 times faster than 'seq 0 10000000 | ../../../target/release/factor > /dev/null'
    3.56 ± 0.12 times faster than 'seq 0 10000000 | ../../../target/release/coreutils factor > /dev/null'
$ ## 101 factorizations
$ hyperfine -L exe "factor,../../../target/release/factor,../../../target/release/coreutils factor" "seq $((10 ** 7)) $((10 ** 5)) $((2 * (10 ** 7))) | {exe} > /dev/null"
Benchmark #1: seq 10000000 100000 20000000 | factor > /dev/null
  Time (mean ± σ):       4.7 ms ±   0.5 ms    [User: 0.8 ms, System: 6.9 ms]
  Range (min … max):     3.7 ms …   6.6 ms    399 runs

  Warning: Command took less than 5 ms to complete. Results might be inaccurate.

Benchmark #2: seq 10000000 100000 20000000 | ../../../target/release/factor > /dev/null
  Time (mean ± σ):       6.8 ms ±   0.6 ms    [User: 1.0 ms, System: 9.3 ms]
  Range (min … max):     5.7 ms …   9.4 ms    279 runs

Benchmark #3: seq 10000000 100000 20000000 | ../../../target/release/coreutils factor > /dev/null
  Time (mean ± σ):       7.5 ms ±   0.6 ms    [User: 1.1 ms, System: 9.8 ms]
  Range (min … max):     6.3 ms …   9.8 ms    283 runs

Summary
  'seq 10000000 100000 20000000 | factor > /dev/null' ran
    1.45 ± 0.21 times faster than 'seq 10000000 100000 20000000 | ../../../target/release/factor > /dev/null'
    1.60 ± 0.22 times faster than 'seq 10000000 100000 20000000 | ../../../target/release/coreutils factor > /dev/null'

Thanks to @nbraud for the initial heavy lifting!

There are some unrelated new clippy warnings that I'll fix in a subsequent PR (see #1603).

@rivy rivy marked this pull request as ready for review October 10, 2020 21:21
@rivy rivy requested a review from Arcterus October 10, 2020 21:22
nicoonoclaste and others added 15 commits October 10, 2020 20:35
This way, we can easily replace u8 with a larger type when moving to support
larger integers.
The new type can be used to represent in-progress factorisations,
which contain non-prime factors.
~18% faster than BTreeMap, and ~5% faster than 'master'
~2.9% faster than the previous commit, ~11% faster than “master” overall.
~7% slowdown, paves the way for upcoming improvements
~17% faster, many optimisation opportunities still missed  >:)
The invariant is checked by a debug_assert!, and follows from the previous
commit, as `dec` and `factors` only ever contains coprime numbers:
  - true at the start: factor = ∅ and dec = { n¹ } ;
  - on every loop iteration, we pull out an element `f` from `dec` and either:
    - discover it is prime, and add it to `factors` ;
    - split it into a number of coprime factors, that get reinserted into `dec`;
      the invariant is maintained, as all divisors of `f` are coprime with all
      numbers in `dec` and `factors` (as `f` itself is coprime with them.

As we only add elements to `Decomposition` objects that are coprime with the
existing ones, they are distinct.
This avoids allocating on the heap when factoring most numbers,
without using much space on the stack.

This is ~3.5% faster than the previous commit, and ~8.3% faster than “master”.
@rivy rivy requested a review from a team October 25, 2020 05:30
@rivy
Copy link
Copy Markdown
Contributor

rivy commented Oct 25, 2020

@uutils/maintainers , if there are no objections, I'll merge this on Monday.

@rivy rivy changed the title factor: Deduplicate divisors performance/factor ~ deduplicate divisors Oct 26, 2020
@rivy rivy changed the title performance/factor ~ deduplicate divisors perf/factor ~ deduplicate divisors Oct 26, 2020
rivy added a commit that referenced this pull request Oct 26, 2020
perf/factor ~ deduplicate divisors
@rivy
Copy link
Copy Markdown
Contributor

rivy commented Oct 27, 2020

Closed via merge commit effb94b.

@rivy rivy closed this Oct 27, 2020
sylvestre pushed a commit that referenced this pull request Mar 20, 2021
It was a draft PR, not ready for merging, and its premature inclusion
caused repeated issues, see 368f473 & friends.

Close #1841.

This reverts commits 3743a3e,
                     ce218e0, and
                     b7b0c76.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants