Skip to content

Feature/compact blobs#5199

Closed
zszheng wants to merge 3 commits intolance-format:mainfrom
zszheng:feature/compact_blobs
Closed

Feature/compact blobs#5199
zszheng wants to merge 3 commits intolance-format:mainfrom
zszheng:feature/compact_blobs

Conversation

@zszheng
Copy link
Copy Markdown

@zszheng zszheng commented Nov 9, 2025

Issue

#5115

Improvement Points

The current implementation has not yet considered the reuse of Fragment processing for multiple blob Columns, which can reduce IO operations and improve Compact performance

Verification

  1. Write a batch of data to generate multiple *lance files
  2. Trigger compact_files to merge data files
  3. Read the actual data to check if the content matches the expected results
  4. Perform cleanup, check the merging of *lance files, and query blob data

@github-actions github-actions Bot added the python label Nov 9, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 9, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +139 to +153
let input_schema = inner.schema();
let blob_names: HashSet<&String> = blob_field_name_to_id.iter().map(|(n, _)| n).collect();

for (name, _) in &blob_field_name_to_id {
if input_schema.column_with_name(name).is_none() {
panic!("Input schema missing blob field: {}", name);
}
}

let fields: Vec<Field> = input_schema
.fields()
.iter()
.map(|f| {
if blob_names.contains(f.name()) {
Field::new(f.name(), DataType::LargeBinary, f.is_nullable())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Fix blob name lookup compilation error

The new ResolvedBlobStream::new creates a HashSet<&String> and later calls blob_names.contains(f.name()) where f.name() yields &str. HashSet::contains requires the same borrowed type as its keys, but &String does not implement Borrow<str>, so the call does not compile. As written this module will fail to build. Store owned Strings in the set (e.g. HashSet<String>) or convert the lookup to compare against a String.

Useful? React with 👍 / 👎.

Comment on lines +83 to +95
let data_file = frag
.data_file_for_field(blob_field_id)
.ok_or_else(|| DataFusionError::Execution("blob data file not found".to_string()))?;

let path_str = dataset.data_dir().child(data_file.path.as_str()).to_string();
let local_path = PathBuf::from("/".to_owned() + &path_str);

let mut file = File::open(&local_path)
.map_err(|e| DataFusionError::External(Box::new(e)))?;
file.seek(SeekFrom::Start(position))
.map_err(|e| DataFusionError::External(Box::new(e)))?;
let mut buffer = vec![0; size];
file.read_exact(&mut buffer)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Blob resolver bypasses object store

When materializing blob columns, the code constructs a local filesystem path and reads with std::fs::File. This ignores the dataset’s configured object store and only works for local POSIX paths; datasets stored on S3, GCS, or any non-local store will fail during compaction because the files cannot be opened. It also performs blocking I/O inside the async stream. The blob bytes should be fetched through the dataset’s object_store (similar to BlobFile elsewhere) rather than direct File access.

Useful? React with 👍 / 👎.

@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Nov 10, 2025

Should be supported by #5189

@Xuanwo Xuanwo closed this Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants