857 speed up repeated csv reads binary #944

EricBenschneider · 2025-02-16T18:07:30Z

Description:

This PR refactors CSV parsing to support daphne's binary data format (.dbdf) that saves DenseMatrix and Frame objects in a streamlined manner. The new binary (daphne) saving mechanism demonstrates a clear performance increase and reliable operation. Key changes include:

Efficient Binary Saving and Loading:

When the use-dbdf-optimization flag is set or save_csv_as_bin has been set in the user config to true, the reader first attempts to load a preformatted .dbdf file using readDaphne().
If the .dbdf file is found and valid, the data is loaded directly from the .dbdf file.
If the .dbdf file is not found or an error occurs during its load, the standard CSV parsing path is executed, and a .dbdf file is generated afterward using writeDaphne().
This dual-path strategy ensures that subsequent runs benefit from fast native binary I/O.

Performance Gains:

Experiments show that the .dbdf saving mechanism significantly reduces read times.
The folowing chart lists the file size ratio in comparison to the original csv file. REP being a matrix with repeating signed integers.

The csv files used for this results were generated. The resulting frame consists of a mix of all numeric value types currently supported by daphne. Evenly distributed.

The csv files used for this results were generated. The resulting matrix consists of random floating point values.

Concluding from the results, the first read is noticeably slower than the normal read. But the performance increase on multiple reads seem justify the usability of this feature.

Supported Data Structures:

The mechanism works for both DenseMatrix (with numeric value types) and Frame objects that doesn't contain strings. This limitation is due to the fact, that currently daphne's binary data format doesn't support strings. But if that changes in the future this feature could similarly extend to also support string values.

Testing:

Unit Tests:
The existing test suite (e.g., in ReadCsvTest.cpp) now verifies that:

When the feature is enabled, a .dbdf file is created on the first read.
Subsequent reads load data from the .dbdf file and yield identical results to the standard CSV parsing path.
Both DenseMatrix and Frame types function correctly with the new binary saving.

System-Level Tests:
System-level tests have been executed by running full .daphne files that use the readFrame and readMatrix functions. These tests confirm that:

The binary saving mechanism consistently produces the expected .dbdf file.
Overall performance improvements are observed without sacrificing correctness.
Data loaded from the binary file is identical to that loaded via CSV parsing.
The binary file is generated reliably when none exists and used subsequently for accelerated reads.

Overall, the new .dbdf saving workflow is working as expected and provides a significant performance boost.

Please review and test the changes. Feedback is welcome!

This reverts commit 85ea77a.

This reverts commit 8d79817.

EricBenschneider mentioned this pull request Feb 16, 2025

857 speed up repeated csv reads using positional map #945

Open

EricBenschneider added 6 commits February 17, 2025 21:36

added generateFileMetaData

b3d829e

added tests for meta data generation

30edf69

updated read kernel and readMetaData for meta data generation

b9f5913

used matrix/frame flag for meta data generation

312b30c

ran clang-format

030aa48

fixed runtime error when trying to save generated file

37f7d68

EricBenschneider force-pushed the 857-speed-up-repeated-csv-reads-binary branch from d336e5a to 370822d Compare February 17, 2025 22:20

EricBenschneider added 22 commits February 24, 2025 03:57

using positional map for frame reading

b7be227

added positional map utility functions

dcd653e

posMap working but indexes screwed

2f82483

new tests

697e105

update tests to not use newline

52a7d2b

wsl stuff

b0f011c

refactor old readcsvfile for frames

8e01228

added daphne file util to csv

e43ea36

conv to unix file endings

4febfe8

added config for read optimizations

9468475

fixed flag usage

e258539

added config for read optimization

68abc97

metadata test fix

e4979a2

added generateFileMetaData

4e72247

added tests for meta data generation

e7c0751

updated read kernel and readMetaData for meta data generation

9bdbaf3

updated DaphneDSL to use label flag

6e84c9e

improved generateMetaDataTest

ad8650f

added systest for reading frame without meta data

6b724e8

Revert "updated DaphneDSL to use label flag"

40c3ffe

This reverts commit 85ea77a.

removed label flag

d3cbc9e

improved generateMetaDataTest

29973e1

EricBenschneider added 26 commits February 24, 2025 03:57

finished bin files and added tests

e918b3a

added support for dense matrix

56dfd15

added support for csr matrix

ca7f2e4

changes to matrix optimization

a70701d

added readopt commandline flag

549dbf3

used dbdf file ending

51455f6

finished frames opt

29e7058

added evaluation artifacts

7f4785a

positional map overhaul

b131054

Revert "positional map overhaul"

8b8cce8

This reverts commit 8d79817.

positional map update

fd0f031

removed positional map

ae8f4ac

removed line prints

8c7d91d

updated eval

7961621

removed posmap

65b33cb

eval code

fbcc087

automated evaluation result saving

03590e9

added systests for reads using optimization

3abd2ba

fixed rebase errors

3999735

commented prints

2c0b679

fixed tests and rebase errors

bfdbfde

added experiment script

ffa39bc

ran first experiments and created charts

7f6e37e

used single flag for optimizations

1beb863

added prints for evaluation

4821bcf

finished evaluation

953a5e4

EricBenschneider force-pushed the 857-speed-up-repeated-csv-reads-binary branch from 370822d to 953a5e4 Compare February 24, 2025 03:28

EricBenschneider marked this pull request as ready for review February 24, 2025 03:54

pdamme self-requested a review March 24, 2025 18:55

pdamme added the LDE winter 2024/25 Student project in the course Large-scale Data Engineering at TU Berlin (winter 2024/25). label Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

857 speed up repeated csv reads binary #944

857 speed up repeated csv reads binary #944

Uh oh!

EricBenschneider commented Feb 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

857 speed up repeated csv reads binary #944

Are you sure you want to change the base?

857 speed up repeated csv reads binary #944

Uh oh!

Conversation

EricBenschneider commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Efficient Binary Saving and Loading:

Performance Gains:

Supported Data Structures:

Testing:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EricBenschneider commented Feb 16, 2025 •

edited

Loading