Skip to content

join: improve performance#3092

Merged
sylvestre merged 7 commits intouutils:mainfrom
jtracey:join-performance
Feb 10, 2022
Merged

join: improve performance#3092
sylvestre merged 7 commits intouutils:mainfrom
jtracey:join-performance

Conversation

@jtracey
Copy link
Contributor

@jtracey jtracey commented Feb 7, 2022

This adds a series of performance improvements to join:

  • Lock and buffer stdout, to save on costs in our single-thread use case and prevent overhead from things like writing single newlines.
  • Use faster techniques for finding and keeping track of fields.
  • Make an intelligent guess of how many fields there will be, rather than repeatedly growing every single line.

See the following benchmarks (each of the join executables has the cumulative improvements up to that commit, not just the respective one):

$ hyperfine -w 5 \
 "~/Documents/coreutils/join.main -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "~/Documents/coreutils/join.buffered -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "~/Documents/coreutils/join.field_parsing -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "~/Documents/coreutils/join.field_guess -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "join -i -t, -j14 kitch_bench.csv wr_bench.csv"
Benchmark 1: ~/Documents/coreutils/join.main -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     701.5 ms ±   8.5 ms    [User: 658.5 ms, System: 42.9 ms]
  Range (min … max):   690.4 ms … 716.4 ms    10 runs
 
Benchmark 2: ~/Documents/coreutils/join.buffered -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     471.3 ms ±  12.7 ms    [User: 444.6 ms, System: 26.6 ms]
  Range (min … max):   457.1 ms … 500.9 ms    10 runs
 
Benchmark 3: ~/Documents/coreutils/join.field_parsing -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     257.2 ms ±   4.4 ms    [User: 227.4 ms, System: 29.7 ms]
  Range (min … max):   252.2 ms … 266.7 ms    11 runs
 
Benchmark 4: ~/Documents/coreutils/join.field_guess -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     194.8 ms ±   4.1 ms    [User: 170.4 ms, System: 24.6 ms]
  Range (min … max):   188.3 ms … 202.9 ms    15 runs
 
Benchmark 5: join -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     240.0 ms ±   5.5 ms    [User: 212.6 ms, System: 27.5 ms]
  Range (min … max):   233.7 ms … 251.4 ms    12 runs
 
Summary
  '~/Documents/coreutils/join.field_guess -i -t, -j14 kitch_bench.csv wr_bench.csv' ran
    1.23 ± 0.04 times faster than 'join -i -t, -j14 kitch_bench.csv wr_bench.csv'
    1.32 ± 0.04 times faster than '~/Documents/coreutils/join.field_parsing -i -t, -j14 kitch_bench.csv wr_bench.csv'
    2.42 ± 0.08 times faster than '~/Documents/coreutils/join.buffered -i -t, -j14 kitch_bench.csv wr_bench.csv'
    3.60 ± 0.09 times faster than '~/Documents/coreutils/join.main -i -t, -j14 kitch_bench.csv wr_bench.csv'

The GNU join tested is from the GNU coreutils 8.32, from the default package shipped in Ubuntu 21.10. The kitch_bench.csv and wr_bench.csv files are made of public address data, with some minor post-processing (IIRC, just sorting the generated originals, but I could have forgotten another step or two). They are 14 MiB and 38 MiB, respectively, with many but not entirely common records. As you can see, all told, we're now 3.6x faster than the main branch currently, and 1.2x faster than GNU.

Also included is a commit adding some minor improvements to error handling (necessary to allow flushing stdout if we're about to terminate from unordered lines; without this, the GNU test will occasionally fail), and a commit adding documentation on benchmarking join.

The last of these optimizations, guessing how many fields there will be, has a minor worst-case scenario: we guess that the number of fields in a line is going to be the greatest number of fields we've seen in a line yet in this file, which in the typical case of "every line has the same number of fields" works perfectly, but in the unusual case of "a single line at the start with far more fields than any other line in the file", causes some memory overhead. Since each Line is only kept in memory long enough to find all its matches, and the process would need enough memory for the single long Line anyway, this shouldn't be much of an issue.

By abstracting the writer we write to, we can lock stdout once at the
beginning, then use buffered writes to it throughout.
Using indexes into the line instead of Vec<u8>s means we don't have to copy
the line to store the fields (indexes instead of slices because it avoids
self-referential structs). Using memchr also empirically saves a lot of
intermediate allocations.
This lets us use fewer reallocations when parsing each line.
The current guess is set to the maximum fields in a line so far. This is
a free performance win in the common case where each line has the same
number of fields, but comes with some memory overhead in the case where
there is a line with lots of fields at the beginning of the file, and
fewer later, but each of these lines are typically not kept for very
long anyway.
@sylvestre
Copy link
Contributor

wahou, terrific. Is it possible to add some tests to cover the uncovered lines? (mostly error mgmt)

@jtracey
Copy link
Contributor Author

jtracey commented Feb 7, 2022

Hmm... I can definitely add a test for a failure on --check-order, that probably should have been there before this PR anyway, but most of the new untested lines are for handling the IOError case. Does the testing framework have a good way to consistently cause reading an open file or writing to stdout to fail? Off the top of my head, the only way I know to cause that in the real world (without some sort of file system failure or kernel bug) is to use network sockets, which seems a bit heavy-handed here.

@jtracey
Copy link
Contributor Author

jtracey commented Feb 8, 2022

Okay, I found some other tests that use /dev/full to break stdout, that should hopefully get us most of the way there. That and --check-order are now tested. I still don't know how to create bad reads on command.

I also added a commit to make sure when using --check-order that stdout gets flushed before printing the final error message.

@sylvestre sylvestre merged commit e818fd2 into uutils:main Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants