join: improve performance by jtracey · Pull Request #3092 · uutils/coreutils

jtracey · 2022-02-07T04:44:48Z

This adds a series of performance improvements to join:

Lock and buffer stdout, to save on costs in our single-thread use case and prevent overhead from things like writing single newlines.
Use faster techniques for finding and keeping track of fields.
Make an intelligent guess of how many fields there will be, rather than repeatedly growing every single line.

See the following benchmarks (each of the join executables has the cumulative improvements up to that commit, not just the respective one):

$ hyperfine -w 5 \
 "~/Documents/coreutils/join.main -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "~/Documents/coreutils/join.buffered -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "~/Documents/coreutils/join.field_parsing -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "~/Documents/coreutils/join.field_guess -i -t, -j14 kitch_bench.csv wr_bench.csv" \
 "join -i -t, -j14 kitch_bench.csv wr_bench.csv"
Benchmark 1: ~/Documents/coreutils/join.main -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     701.5 ms ±   8.5 ms    [User: 658.5 ms, System: 42.9 ms]
  Range (min … max):   690.4 ms … 716.4 ms    10 runs
 
Benchmark 2: ~/Documents/coreutils/join.buffered -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     471.3 ms ±  12.7 ms    [User: 444.6 ms, System: 26.6 ms]
  Range (min … max):   457.1 ms … 500.9 ms    10 runs
 
Benchmark 3: ~/Documents/coreutils/join.field_parsing -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     257.2 ms ±   4.4 ms    [User: 227.4 ms, System: 29.7 ms]
  Range (min … max):   252.2 ms … 266.7 ms    11 runs
 
Benchmark 4: ~/Documents/coreutils/join.field_guess -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     194.8 ms ±   4.1 ms    [User: 170.4 ms, System: 24.6 ms]
  Range (min … max):   188.3 ms … 202.9 ms    15 runs
 
Benchmark 5: join -i -t, -j14 kitch_bench.csv wr_bench.csv
  Time (mean ± σ):     240.0 ms ±   5.5 ms    [User: 212.6 ms, System: 27.5 ms]
  Range (min … max):   233.7 ms … 251.4 ms    12 runs
 
Summary
  '~/Documents/coreutils/join.field_guess -i -t, -j14 kitch_bench.csv wr_bench.csv' ran
    1.23 ± 0.04 times faster than 'join -i -t, -j14 kitch_bench.csv wr_bench.csv'
    1.32 ± 0.04 times faster than '~/Documents/coreutils/join.field_parsing -i -t, -j14 kitch_bench.csv wr_bench.csv'
    2.42 ± 0.08 times faster than '~/Documents/coreutils/join.buffered -i -t, -j14 kitch_bench.csv wr_bench.csv'
    3.60 ± 0.09 times faster than '~/Documents/coreutils/join.main -i -t, -j14 kitch_bench.csv wr_bench.csv'

The GNU join tested is from the GNU coreutils 8.32, from the default package shipped in Ubuntu 21.10. The kitch_bench.csv and wr_bench.csv files are made of public address data, with some minor post-processing (IIRC, just sorting the generated originals, but I could have forgotten another step or two). They are 14 MiB and 38 MiB, respectively, with many but not entirely common records. As you can see, all told, we're now 3.6x faster than the main branch currently, and 1.2x faster than GNU.

Also included is a commit adding some minor improvements to error handling (necessary to allow flushing stdout if we're about to terminate from unordered lines; without this, the GNU test will occasionally fail), and a commit adding documentation on benchmarking join.

The last of these optimizations, guessing how many fields there will be, has a minor worst-case scenario: we guess that the number of fields in a line is going to be the greatest number of fields we've seen in a line yet in this file, which in the typical case of "every line has the same number of fields" works perfectly, but in the unusual case of "a single line at the start with far more fields than any other line in the file", causes some memory overhead. Since each Line is only kept in memory long enough to find all its matches, and the process would need enough memory for the single long Line anyway, this shouldn't be much of an issue.

By abstracting the writer we write to, we can lock stdout once at the beginning, then use buffered writes to it throughout.

Using indexes into the line instead of Vec<u8>s means we don't have to copy the line to store the fields (indexes instead of slices because it avoids self-referential structs). Using memchr also empirically saves a lot of intermediate allocations.

This lets us use fewer reallocations when parsing each line. The current guess is set to the maximum fields in a line so far. This is a free performance win in the common case where each line has the same number of fields, but comes with some memory overhead in the case where there is a line with lots of fields at the beginning of the file, and fewer later, but each of these lines are typically not kept for very long anyway.

sylvestre · 2022-02-07T14:40:13Z

wahou, terrific. Is it possible to add some tests to cover the uncovered lines? (mostly error mgmt)

jtracey · 2022-02-07T18:09:07Z

Hmm... I can definitely add a test for a failure on --check-order, that probably should have been there before this PR anyway, but most of the new untested lines are for handling the IOError case. Does the testing framework have a good way to consistently cause reading an open file or writing to stdout to fail? Off the top of my head, the only way I know to cause that in the real world (without some sort of file system failure or kernel bug) is to use network sockets, which seems a bit heavy-handed here.

jtracey · 2022-02-08T04:04:47Z

Okay, I found some other tests that use /dev/full to break stdout, that should hopefully get us most of the way there. That and --check-order are now tested. I still don't know how to create bad reads on command.

I also added a commit to make sure when using --check-order that stdout gets flushed before printing the final error message.

jtracey added 5 commits February 6, 2022 23:37

join: improve error handling

c12f393

join: lock and buffer stdout

e6f59b1

By abstracting the writer we write to, we can lock stdout once at the beginning, then use buffered writes to it throughout.

join: add benchmarking documentation

41c90d7

jtracey force-pushed the join-performance branch from 80e33a6 to 41c90d7 Compare February 7, 2022 04:53

tertsdiepraam mentioned this pull request Feb 7, 2022

Discussion: Guidelines for printing to stdout #3093

Open

jtracey added 2 commits February 7, 2022 22:26

join: add tests for --check-order and stdout error

bf67c5d

join: flush stdout before final error message

b873d46

jtracey force-pushed the join-performance branch from 58db20b to b873d46 Compare February 8, 2022 03:32

sylvestre merged commit e818fd2 into uutils:main Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

join: improve performance#3092

join: improve performance#3092
sylvestre merged 7 commits intouutils:mainfrom
jtracey:join-performance

jtracey commented Feb 7, 2022

Uh oh!

sylvestre commented Feb 7, 2022

Uh oh!

jtracey commented Feb 7, 2022

Uh oh!

jtracey commented Feb 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jtracey commented Feb 7, 2022

Uh oh!

sylvestre commented Feb 7, 2022

Uh oh!

jtracey commented Feb 7, 2022

Uh oh!

jtracey commented Feb 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants