Faster parsing: reduce String usage, list-based input rows.#15681
Merged
asdf2014 merged 8 commits intoapache:masterfrom Jan 18, 2024
Merged
Faster parsing: reduce String usage, list-based input rows.#15681asdf2014 merged 8 commits intoapache:masterfrom
asdf2014 merged 8 commits intoapache:masterfrom
Conversation
Three changes: 1) Reworked FastLineIterator to optionally avoid generating Strings entirely, and reduce copying somewhat. Benefits the line-oriented JSON, CSV, delimited (TSV), and regex formats. 2) In the delimited (TSV) format, when the delimiter is a single byte, split on UTF-8 bytes directly. 3) In CSV and delimited (TSV) formats, use list-based input rows when the column list is provided upfront by the user.
Contributor
Author
|
The only failure is |
asdf2014
approved these changes
Jan 18, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three changes:
Reworked FastLineIterator to optionally avoid generating Strings
entirely, and reduce copying somewhat. Benefits the line-oriented
JSON, CSV, delimited (TSV), and regex formats.
In the delimited (TSV) format, when the delimiter is a single byte,
split on UTF-8 bytes directly.
In CSV and delimited (TSV) formats, use list-based input rows when
the column list is provided upfront by the user.
Benchmarks below. Findings:
JsonLineReaderBenchmarkonly benefits from change (1), and got a 15% improvement.DelimitedInputFormatBenchmarkwithfromHeader: truebenefits from (1) and (2), and got a 22% improvement.DelimitedInputFormatBenchmarkwithfromHeader: falsebenefits from all three changes, and got a 30% improvement.