Feature/compact blobs by zszheng · Pull Request #5199 · lance-format/lance

zszheng · 2025-11-09T13:56:03Z

Issue

Improvement Points

The current implementation has not yet considered the reuse of Fragment processing for multiple blob Columns, which can reduce IO operations and improve Compact performance

Verification

Write a batch of data to generate multiple *lance files
Trigger compact_files to merge data files
Read the actual data to check if the content matches the expected results
Perform cleanup, check the merging of *lance files, and query blob data

github-actions · 2025-11-09T13:56:26Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-09T13:59:47Z

+        let input_schema = inner.schema();
+        let blob_names: HashSet<&String> = blob_field_name_to_id.iter().map(|(n, _)| n).collect();
+
+        for (name, _) in &blob_field_name_to_id {
+            if input_schema.column_with_name(name).is_none() {
+                panic!("Input schema missing blob field: {}", name);
+            }
+        }
+
+        let fields: Vec<Field> = input_schema
+            .fields()
+            .iter()
+            .map(|f| {
+                if blob_names.contains(f.name()) {
+                    Field::new(f.name(), DataType::LargeBinary, f.is_nullable())


Fix blob name lookup compilation error

The new ResolvedBlobStream::new creates a HashSet<&String> and later calls blob_names.contains(f.name()) where f.name() yields &str. HashSet::contains requires the same borrowed type as its keys, but &String does not implement Borrow<str>, so the call does not compile. As written this module will fail to build. Store owned Strings in the set (e.g. HashSet<String>) or convert the lookup to compare against a String.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-11-09T13:59:47Z

+        let data_file = frag
+            .data_file_for_field(blob_field_id)
+            .ok_or_else(|| DataFusionError::Execution("blob data file not found".to_string()))?;
+
+        let path_str = dataset.data_dir().child(data_file.path.as_str()).to_string();
+        let local_path = PathBuf::from("/".to_owned() + &path_str);
+
+        let mut file = File::open(&local_path)
+            .map_err(|e| DataFusionError::External(Box::new(e)))?;
+        file.seek(SeekFrom::Start(position))
+            .map_err(|e| DataFusionError::External(Box::new(e)))?;
+        let mut buffer = vec![0; size];
+        file.read_exact(&mut buffer)


Blob resolver bypasses object store

When materializing blob columns, the code constructs a local filesystem path and reads with std::fs::File. This ignores the dataset’s configured object store and only works for local POSIX paths; datasets stored on S3, GCS, or any non-local store will fail during compaction because the files cannot be opened. It also performs blocking I/O inside the async stream. The blob bytes should be fetched through the dataset’s object_store (similar to BlobFile elsewhere) rather than direct File access.

Useful? React with 👍 / 👎.

Xuanwo · 2025-11-10T05:34:44Z

Should be supported by #5189

zsScofield added 3 commits November 9, 2025 19:53

support to compact_files with blob

5f71f9c

unit test to clean up after compact blobs

0396f6e

update support multi blobs

a60fc55

github-actions Bot added the python label Nov 9, 2025

chatgpt-codex-connector Bot reviewed Nov 9, 2025

View reviewed changes

Xuanwo closed this Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/compact blobs#5199

Feature/compact blobs#5199
zszheng wants to merge 3 commits intolance-format:mainfrom
zszheng:feature/compact_blobs

zszheng commented Nov 9, 2025

Uh oh!

github-actions Bot commented Nov 9, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Nov 9, 2025

Uh oh!

chatgpt-codex-connector Bot Nov 9, 2025

Uh oh!

Xuanwo commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zszheng commented Nov 9, 2025

Issue

Improvement Points

Verification

Uh oh!

github-actions Bot commented Nov 9, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Xuanwo commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants