Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let input_schema = inner.schema(); | ||
| let blob_names: HashSet<&String> = blob_field_name_to_id.iter().map(|(n, _)| n).collect(); | ||
|
|
||
| for (name, _) in &blob_field_name_to_id { | ||
| if input_schema.column_with_name(name).is_none() { | ||
| panic!("Input schema missing blob field: {}", name); | ||
| } | ||
| } | ||
|
|
||
| let fields: Vec<Field> = input_schema | ||
| .fields() | ||
| .iter() | ||
| .map(|f| { | ||
| if blob_names.contains(f.name()) { | ||
| Field::new(f.name(), DataType::LargeBinary, f.is_nullable()) |
There was a problem hiding this comment.
Fix blob name lookup compilation error
The new ResolvedBlobStream::new creates a HashSet<&String> and later calls blob_names.contains(f.name()) where f.name() yields &str. HashSet::contains requires the same borrowed type as its keys, but &String does not implement Borrow<str>, so the call does not compile. As written this module will fail to build. Store owned Strings in the set (e.g. HashSet<String>) or convert the lookup to compare against a String.
Useful? React with 👍 / 👎.
| let data_file = frag | ||
| .data_file_for_field(blob_field_id) | ||
| .ok_or_else(|| DataFusionError::Execution("blob data file not found".to_string()))?; | ||
|
|
||
| let path_str = dataset.data_dir().child(data_file.path.as_str()).to_string(); | ||
| let local_path = PathBuf::from("/".to_owned() + &path_str); | ||
|
|
||
| let mut file = File::open(&local_path) | ||
| .map_err(|e| DataFusionError::External(Box::new(e)))?; | ||
| file.seek(SeekFrom::Start(position)) | ||
| .map_err(|e| DataFusionError::External(Box::new(e)))?; | ||
| let mut buffer = vec![0; size]; | ||
| file.read_exact(&mut buffer) |
There was a problem hiding this comment.
Blob resolver bypasses object store
When materializing blob columns, the code constructs a local filesystem path and reads with std::fs::File. This ignores the dataset’s configured object store and only works for local POSIX paths; datasets stored on S3, GCS, or any non-local store will fail during compaction because the files cannot be opened. It also performs blocking I/O inside the async stream. The blob bytes should be fetched through the dataset’s object_store (similar to BlobFile elsewhere) rather than direct File access.
Useful? React with 👍 / 👎.
|
Should be supported by #5189 |
Issue
#5115
Improvement Points
The current implementation has not yet considered the reuse of Fragment processing for multiple blob Columns, which can reduce IO operations and improve Compact performance
Verification