Skip to content

feat: break stream by max bytes param#3435

Closed
SaintBacchus wants to merge 2 commits intolance-format:mainfrom
SaintBacchus:BreakStream
Closed

feat: break stream by max bytes param#3435
SaintBacchus wants to merge 2 commits intolance-format:mainfrom
SaintBacchus:BreakStream

Conversation

@SaintBacchus
Copy link
Copy Markdown
Collaborator

try to fix #3393

@github-actions github-actions Bot added the enhancement New feature or request label Feb 7, 2025
}

if num_rows_in_current_file >= params.max_rows_per_file as u32
|| writer.as_mut().unwrap().tell().await? >= params.max_bytes_per_file as u64
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tell() always returns 0 since the writer has not called the finish()

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 24, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.

Project coverage is 78.46%. Comparing base (356d132) to head (0ef5b56).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-datafusion/src/chunker.rs 92.30% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3435      +/-   ##
==========================================
- Coverage   78.46%   78.46%   -0.01%     
==========================================
  Files         252      252              
  Lines       93917    94032     +115     
  Branches    93917    94032     +115     
==========================================
+ Hits        73696    73778      +82     
- Misses      17226    17262      +36     
+ Partials     2995     2992       -3     
Flag Coverage Δ
unittests 78.46% <92.30%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SaintBacchus SaintBacchus force-pushed the BreakStream branch 2 times, most recently from d94d166 to ad8d375 Compare February 27, 2025 03:36
format

fix some ut

fix some ut

1G to 90G

ignore one test case

try fix something

format

refactor

1

2

3

4

5
@SaintBacchus
Copy link
Copy Markdown
Collaborator Author

batch.get_array_memory_size() will return all the batch memory sizes, regardless of whether the batch was sliced.

If we use the average size to estimate the memory usage, it's not accurate.

@wjones127
Copy link
Copy Markdown
Contributor

batch.get_array_memory_size() will return all the batch memory sizes, regardless of whether the batch was sliced.

If we use the average size to estimate the memory usage, it's not accurate.

Did you see this https://docs.rs/arrow/latest/arrow/array/struct.ArrayData.html#method.get_slice_memory_size ?

@SaintBacchus
Copy link
Copy Markdown
Collaborator Author

I had missed this method. I will test it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is there a good way to make max_bytes_per_file used in Stable data storage version

3 participants