-
Notifications
You must be signed in to change notification settings - Fork 78
857 speed up repeated csv reads binary #944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
EricBenschneider
wants to merge
61
commits into
daphne-eu:main
Choose a base branch
from
EricBenschneider:857-speed-up-repeated-csv-reads-binary
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
857 speed up repeated csv reads binary #944
EricBenschneider
wants to merge
61
commits into
daphne-eu:main
from
EricBenschneider:857-speed-up-repeated-csv-reads-binary
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d336e5a to
370822d
Compare
This reverts commit 85ea77a.
This reverts commit 8d79817.
370822d to
953a5e4
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
LDE winter 2024/25
Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25).
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
This PR refactors CSV parsing to support daphne's binary data format (.dbdf) that saves DenseMatrix and Frame objects in a streamlined manner. The new binary (daphne) saving mechanism demonstrates a clear performance increase and reliable operation. Key changes include:
Efficient Binary Saving and Loading:
When the use-dbdf-optimization flag is set or save_csv_as_bin has been set in the user config to true, the reader first attempts to load a preformatted .dbdf file using readDaphne().
If the .dbdf file is found and valid, the data is loaded directly from the .dbdf file.
If the .dbdf file is not found or an error occurs during its load, the standard CSV parsing path is executed, and a .dbdf file is generated afterward using writeDaphne().
This dual-path strategy ensures that subsequent runs benefit from fast native binary I/O.
Performance Gains:
Experiments show that the .dbdf saving mechanism significantly reduces read times.

The folowing chart lists the file size ratio in comparison to the original csv file. REP being a matrix with repeating signed integers.
The csv files used for this results were generated. The resulting frame consists of a mix of all numeric value types currently supported by daphne. Evenly distributed.

The csv files used for this results were generated. The resulting matrix consists of random floating point values.

Concluding from the results, the first read is noticeably slower than the normal read. But the performance increase on multiple reads seem justify the usability of this feature.
Supported Data Structures:
The mechanism works for both DenseMatrix (with numeric value types) and Frame objects that doesn't contain strings. This limitation is due to the fact, that currently daphne's binary data format doesn't support strings. But if that changes in the future this feature could similarly extend to also support string values.
Testing:
Unit Tests:
The existing test suite (e.g., in ReadCsvTest.cpp) now verifies that:
System-Level Tests:
System-level tests have been executed by running full .daphne files that use the readFrame and readMatrix functions. These tests confirm that:
Overall, the new .dbdf saving workflow is working as expected and provides a significant performance boost.
Please review and test the changes. Feedback is welcome!