-
Notifications
You must be signed in to change notification settings - Fork 638
feat(compaction): binary copy capability for compaction #5434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5c70543
79dc823
36d0a9c
a7711f8
a5bd940
e0bc0f9
c683974
0e6d74b
041a79b
cf46643
b3a748e
9646e51
582074a
a352b16
2faa848
c8255ff
f9d10b0
ab78372
5b74fb7
d35a130
902839d
651a52c
1f7c558
7492c4d
9f0ac0f
ed5aed2
7c396ab
2ece222
65befe6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -495,6 +495,26 @@ impl FileWriter { | |
| self.schema_metadata.insert(key.into(), value.into()); | ||
| } | ||
|
|
||
| /// Prepare the writer when column data and metadata were produced externally. | ||
| /// | ||
| /// This is useful for flows that copy already-encoded pages (e.g., binary copy | ||
| /// during compaction) where the column buffers have been written directly and we | ||
| /// only need to write the footer and schema metadata. The provided | ||
| /// `column_metadata` must describe the buffers already persisted by the | ||
| /// underlying `ObjectWriter`, and `rows_written` should reflect the total number | ||
| /// of rows in those buffers. | ||
| pub fn initialize_with_external_metadata( | ||
| &mut self, | ||
| schema: lance_core::datatypes::Schema, | ||
| column_metadata: Vec<pbfile::ColumnMetadata>, | ||
| rows_written: u64, | ||
| ) { | ||
| self.schema = Some(schema); | ||
| self.num_columns = column_metadata.len() as u32; | ||
| self.column_metadata = column_metadata; | ||
| self.rows_written = rows_written; | ||
| } | ||
|
|
||
| /// Adds a global buffer to the file | ||
| /// | ||
| /// The global buffer can contain any arbitrary bytes. It will be written to the disk | ||
|
|
@@ -595,7 +615,9 @@ impl FileWriter { | |
| .collect::<FuturesOrdered<_>>(); | ||
| self.write_pages(encoding_tasks).await?; | ||
|
|
||
| self.finish_writers().await?; | ||
| if !self.column_writers.is_empty() { | ||
| self.finish_writers().await?; | ||
| } | ||
|
Comment on lines
+618
to
+620
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this change needed?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Binary Copy, we pre-write data and metadata in a copy manner without calling FileWriter. However, when flushing the footer, in order to reuse existing code as much as possible, we will try to mock a file writer and call its finish method to trigger the writing of the footer. Therefore, a check is needed here to skip the execution of finish_writers in scenarios similar to Binary copy (where column_writers is empty). |
||
|
|
||
| // 3. write global buffers (we write the schema here) | ||
| let global_buffer_offsets = self.write_global_buffers().await?; | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.