Skip to content

Conversation

@EricBenschneider
Copy link

@EricBenschneider EricBenschneider commented Feb 16, 2025

Description:

This PR refactors CSV parsing to support daphne's binary data format (.dbdf) that saves DenseMatrix and Frame objects in a streamlined manner. The new binary (daphne) saving mechanism demonstrates a clear performance increase and reliable operation. Key changes include:

Efficient Binary Saving and Loading:

When the use-dbdf-optimization flag is set or save_csv_as_bin has been set in the user config to true, the reader first attempts to load a preformatted .dbdf file using readDaphne().
If the .dbdf file is found and valid, the data is loaded directly from the .dbdf file.
If the .dbdf file is not found or an error occurs during its load, the standard CSV parsing path is executed, and a .dbdf file is generated afterward using writeDaphne().
This dual-path strategy ensures that subsequent runs benefit from fast native binary I/O.

Performance Gains:

Experiments show that the .dbdf saving mechanism significantly reduces read times.
The folowing chart lists the file size ratio in comparison to the original csv file. REP being a matrix with repeating signed integers.
avg_ratio_bar_chart

The csv files used for this results were generated. The resulting frame consists of a mix of all numeric value types currently supported by daphne. Evenly distributed.
overall_read_time_frame_number

The csv files used for this results were generated. The resulting matrix consists of random floating point values.
overall_read_time_matrix_float

Concluding from the results, the first read is noticeably slower than the normal read. But the performance increase on multiple reads seem justify the usability of this feature.

Supported Data Structures:

The mechanism works for both DenseMatrix (with numeric value types) and Frame objects that doesn't contain strings. This limitation is due to the fact, that currently daphne's binary data format doesn't support strings. But if that changes in the future this feature could similarly extend to also support string values.

Testing:

Unit Tests:
The existing test suite (e.g., in ReadCsvTest.cpp) now verifies that:

  • When the feature is enabled, a .dbdf file is created on the first read.
  • Subsequent reads load data from the .dbdf file and yield identical results to the standard CSV parsing path.
  • Both DenseMatrix and Frame types function correctly with the new binary saving.

System-Level Tests:
System-level tests have been executed by running full .daphne files that use the readFrame and readMatrix functions. These tests confirm that:

  • The binary saving mechanism consistently produces the expected .dbdf file.
  • Overall performance improvements are observed without sacrificing correctness.
  • Data loaded from the binary file is identical to that loaded via CSV parsing.
  • The binary file is generated reliably when none exists and used subsequently for accelerated reads.

Overall, the new .dbdf saving workflow is working as expected and provides a significant performance boost.

Please review and test the changes. Feedback is welcome!

@EricBenschneider EricBenschneider force-pushed the 857-speed-up-repeated-csv-reads-binary branch from d336e5a to 370822d Compare February 17, 2025 22:20
@EricBenschneider EricBenschneider force-pushed the 857-speed-up-repeated-csv-reads-binary branch from 370822d to 953a5e4 Compare February 24, 2025 03:28
@EricBenschneider EricBenschneider marked this pull request as ready for review February 24, 2025 03:54
@pdamme pdamme self-requested a review March 24, 2025 18:55
@pdamme pdamme added the LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25). label Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants