-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Change FileScanConfig.table_partition_cols from (String, DataType) to Fields
#7890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb and @crepererum |
| pub limit: Option<usize>, | ||
| /// The partitioning columns | ||
| pub table_partition_cols: Vec<(String, DataType)>, | ||
| pub table_partition_cols: Vec<Field>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the key change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to use FieldRef not Field
| let partition_idx = idx - self.file_schema.fields().len(); | ||
| let (name, dtype) = &self.table_partition_cols[partition_idx]; | ||
| table_fields.push(Field::new(name, dtype.to_owned(), false)); | ||
| table_fields.push(self.table_partition_cols[partition_idx].to_owned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this where we convert table_partition_cols to Field
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @NGA-TRAN -- I think the idea of passing a real Field as the partition column makes a lot of sense and that this PR does it very nicely 👍
I had a few code improvement suggestions, but nothing I think is required to merge this.
Thanks again
datafusion/core/src/datasource/physical_plan/file_scan_config.rs
Outdated
Show resolved
Hide resolved
| ) | ||
| } | ||
|
|
||
| fn config_for_proj_with_field_tab_part( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this name confusing given the three letter abbreviations and I don't think this is common elsewhere in the DataFusion codebase.
How about something like
| fn config_for_proj_with_field_tab_part( | |
| fn config_for_projection_with_partition_fields( |
Or maybe instead you could change config_for_projection to take table_partition_cols: Vec<Field>, and make a function like
/// Convert all
fn partition_cols( table_partition_cols: Vec<(&str, DataType)>) -> Vec<Field> {
table_partition_cols
.iter()
.map(|(name, dtype)| Field::new(name, dtype.clone(), false))
.collect::<Vec<_>>()
}And then convert the call sites of config_for_projection to be config_for_projection(.., partition_cols(..))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented your second suggestion @alamb . Thanks
FileScanConfig.table_partition_cols from (String, DataType) to Fields
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
I have addressed all the comments. Thanks @alamb |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NGA-TRAN
|
Thanks @NGA-TRAN |
Which issue does this PR close?
Closes #7875
Rationale for this change
Currently,
FileScanConfig.table_partition_colshas data typeVec<(String, DataType)>to store only columns name and its data type. A column can include many more information such asnullableand extra meta data. Thus, when we convert table_partition_cols to Fields here, all other information of a field will either empty or default.We want the data type of table_partition_cols a vector of Fields in the first place so when we need to store a Field, we won't lose any information.
FYI: IOx needs this requirement.
What changes are included in this PR?
Replace data type of
FileScanConfig.table_partition_colsfromVec<(String, DataType)>to Vec`Are these changes tested?
Yes
Are there any user-facing changes?
The API to create
FileScanConfigneeds a vector of Fields fortable_partition_cols. Most of the places it is an empty vector means it is not used.