Conversation
By abstracting the writer we write to, we can lock stdout once at the beginning, then use buffered writes to it throughout.
Using indexes into the line instead of Vec<u8>s means we don't have to copy the line to store the fields (indexes instead of slices because it avoids self-referential structs). Using memchr also empirically saves a lot of intermediate allocations.
This lets us use fewer reallocations when parsing each line. The current guess is set to the maximum fields in a line so far. This is a free performance win in the common case where each line has the same number of fields, but comes with some memory overhead in the case where there is a line with lots of fields at the beginning of the file, and fewer later, but each of these lines are typically not kept for very long anyway.
80e33a6 to
41c90d7
Compare
|
wahou, terrific. Is it possible to add some tests to cover the uncovered lines? (mostly error mgmt) |
|
Hmm... I can definitely add a test for a failure on |
58db20b to
b873d46
Compare
|
Okay, I found some other tests that use I also added a commit to make sure when using |
This adds a series of performance improvements to join:
See the following benchmarks (each of the join executables has the cumulative improvements up to that commit, not just the respective one):
The GNU join tested is from the GNU coreutils 8.32, from the default package shipped in Ubuntu 21.10. The
kitch_bench.csvandwr_bench.csvfiles are made of public address data, with some minor post-processing (IIRC, just sorting the generated originals, but I could have forgotten another step or two). They are 14 MiB and 38 MiB, respectively, with many but not entirely common records. As you can see, all told, we're now 3.6x faster than the main branch currently, and 1.2x faster than GNU.Also included is a commit adding some minor improvements to error handling (necessary to allow flushing stdout if we're about to terminate from unordered lines; without this, the GNU test will occasionally fail), and a commit adding documentation on benchmarking join.
The last of these optimizations, guessing how many fields there will be, has a minor worst-case scenario: we guess that the number of fields in a line is going to be the greatest number of fields we've seen in a line yet in this file, which in the typical case of "every line has the same number of fields" works perfectly, but in the unusual case of "a single line at the start with far more fields than any other line in the file", causes some memory overhead. Since each
Lineis only kept in memory long enough to find all its matches, and the process would need enough memory for the single longLineanyway, this shouldn't be much of an issue.