-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix inconsistent schema projection in ListingTable even when schema is specified #16305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix inconsistent schema projection in ListingTable even when schema is specified #16305
Conversation
…le files in ListingTableConfig
47fcdae to
a40a448
Compare
| let opt_enabled = ListingOptions::new(Arc::new(ParquetFormat::default())) | ||
| .with_collect_stat(true); | ||
| let schema_enabled = opt_enabled.infer_schema(&state, &table_path).await?; | ||
| let config_enabled = ListingTableConfig::new(table_path) | ||
| .with_listing_options(opt_enabled) | ||
| .with_schema(schema_enabled); | ||
| let table_enabled = ListingTable::try_new(config_enabled)?; | ||
|
|
||
| let exec_enabled = table_enabled.scan(&state, None, &[], None).await?; | ||
| assert_eq!( | ||
| exec_enabled.partition_statistics(None)?.num_rows, | ||
| Precision::Exact(8) | ||
| ); | ||
| // TODO correct byte size: https://github.com/apache/datafusion/issues/14936 | ||
| assert_eq!( | ||
| exec_enabled.partition_statistics(None)?.total_byte_size, | ||
| Precision::Exact(671) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from previous do_not_load_table_stats_by_default
let opt = ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_collect_stat(true);
let schema = opt.infer_schema(&state, &table_path).await?;
let config = ListingTableConfig::new(table_path)
.with_listing_options(opt)
.with_schema(schema);
let table = ListingTable::try_new(config)?;
let exec = table.scan(&state, None, &[], None).await?;
assert_eq!(
exec.partition_statistics(None)?.num_rows,
Precision::Exact(8)
);
// TODO correct byte size: https://github.com/apache/datafusion/issues/14936
assert_eq!(
exec.partition_statistics(None)?.total_byte_size,
Precision::Exact(671)
);| let opt_default = ListingOptions::new(Arc::new(ParquetFormat::default())); | ||
| let schema_default = opt_default.infer_schema(&state, &table_path).await?; | ||
| let config_default = ListingTableConfig::new(table_path.clone()) | ||
| .with_listing_options(opt_default) | ||
| .with_schema(schema_default); | ||
| let table_default = ListingTable::try_new(config_default)?; | ||
|
|
||
| let exec_default = table_default.scan(&state, None, &[], None).await?; | ||
| assert_eq!( | ||
| exec_default.partition_statistics(None)?.num_rows, | ||
| Precision::Absent | ||
| ); | ||
|
|
||
| // TODO correct byte size: https://github.com/apache/datafusion/issues/14936 | ||
| assert_eq!( | ||
| exec_default.partition_statistics(None)?.total_byte_size, | ||
| Precision::Absent | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace this excerpt from previous do_not_load_table_stats_by_default
let opt = ListingOptions::new(Arc::new(ParquetFormat::default()));
let schema = opt.infer_schema(&state, &table_path).await?;
let config = ListingTableConfig::new(table_path.clone())
.with_listing_options(opt)
.with_schema(schema);
let table = ListingTable::try_new(config)?;
let exec = table.scan(&state, None, &[], None).await?;
assert_eq!(exec.partition_statistics(None)?.num_rows, Precision::Absent);
// TODO correct byte size: https://github.com/apache/datafusion/issues/14936
assert_eq!(
exec.partition_statistics(None)?.total_byte_size,
Precision::Absent
);| /// Indicates the source of the schema for a [`ListingTable`] | ||
| /// PartialEq required for assert_eq! in tests | ||
| #[derive(Debug, Clone, PartialEq)] | ||
| pub enum SchemaSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this, the explicit representation definitely will reduce confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% -- this is great
cc @adriangb
xudong963
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks @xudong963 for the review. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kosiew and @xudong963 -- I agree this is a really nice improvement
| /// Indicates the source of the schema for a [`ListingTable`] | ||
| /// PartialEq required for assert_eq! in tests | ||
| #[derive(Debug, Clone, PartialEq)] | ||
| pub enum SchemaSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% -- this is great
cc @adriangb
| /// See documentation and example on [`ListingTable`] and [`ListingTableConfig`] | ||
| pub fn try_new(config: ListingTableConfig) -> Result<Self> { | ||
| // Extract schema_source before moving other parts of the config | ||
| let schema_source = config.schema_source().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: since SchemaSource is just a bare enum, it might also make sense to #derive(Copy) so we didn't have to clone it explicitly.
This is totally unecessary I just wanted to mention it as a possibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense to #derive(Copy) so we didn't have to clone it explicitly.
💯 and implemented here.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kosiew and @xudong963 -- I agree this is a really nice improvement
0790f53 to
21ad8fb
Compare
Which issue does this PR close?
ListingTableConfig#16270Rationale for this change
Understanding the origin of a schema (whether it was inferred or explicitly specified) is important for debugging, reproducibility, and behavioral consistency in systems like DataFusion that operate on dynamic data sources. Previously, this information was not available in the
ListingTableor its configuration, making it hard to reason about schema behavior.What changes are included in this PR?
Schema-source metadata
SchemaSourceenum to track the origin of a schema (None,Inferred,Specified).ListingTableConfigandListingTableto carry and expose this metadata.schema_source()) to inspect schema origin in both config and table.Imports cleanup
table.rsfor clarity and consistency.Test refactoring & additions
read_single_file.test_table_stats_behaviors.test_list_files_configurations.test_insert_into_parameterized.SchemaSourcecases:test_schema_source_tracking_comprehensiveinfer_preserves_provided_schemacreate_test_schema,generate_test_files…).Deleted tests → New tests mapping
read_single_file(old)read_single_file(refactored)do_not_load_table_stats_by_defaultload_table_stats_when_no_statstest_table_stats_behaviorstest_assert_list_files_for_scan_groupingtest_assert_list_files_for_multi_pathtest_assert_list_files_for_exact_pathstest_list_files_configurationstest_insert_into_append_new_json_filestest_insert_into_append_new_csv_filestest_insert_into_append_2_new_parquet_files_defaultstest_insert_into_append_1_new_parquet_files_defaultstest_insert_into_parameterizedAre these changes tested?
read_single_file,test_table_stats_behaviorstest_schema_source_tracking_comprehensive,infer_preserves_provided_schematest_list_files_configurationstest_insert_into_parameterizedShared helpers and parameterized loops ensure that every previously tested scenario is still exercised, with improved maintainability and coverage.
Are there any user-facing changes?
Yes:
ListingTablenow exposes aschema_source()method, enabling downstream consumers to programmatically check the origin of the schema.There are no breaking changes to the public API, but the enhancement provides improved introspection capabilities.