Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Jun 6, 2025

Which issue does this PR close?

Rationale for this change

Understanding the origin of a schema (whether it was inferred or explicitly specified) is important for debugging, reproducibility, and behavioral consistency in systems like DataFusion that operate on dynamic data sources. Previously, this information was not available in the ListingTable or its configuration, making it hard to reason about schema behavior.

What changes are included in this PR?

  • Schema-source metadata

    • Introduced a new SchemaSource enum to track the origin of a schema (None, Inferred, Specified).
    • Extended ListingTableConfig and ListingTable to carry and expose this metadata.
    • Ensured that schema inference logic respects an explicitly set schema and does not overwrite it.
    • Added public accessors (schema_source()) to inspect schema origin in both config and table.
  • Imports cleanup

    • Reorganized imports in table.rs for clarity and consistency.
  • Test refactoring & additions

    • Refactored single-file scan/statistics tests:
      • Kept and cleaned up read_single_file.
      • Unified Parquet-stats coverage into a single test_table_stats_behaviors.
    • Consolidated file-listing tests into a parameterized test_list_files_configurations.
    • Parameterized insert-into append tests via test_insert_into_parameterized.
    • Added comprehensive unit tests for all SchemaSource cases:
      • test_schema_source_tracking_comprehensive
      • infer_preserves_provided_schema
    • Removed dozens of redundant individual tests in favor of DRY loops and shared helpers (create_test_schema, generate_test_files…).

Deleted tests → New tests mapping

Deleted test(s) Replacement tests
read_single_file (old) read_single_file (refactored)
do_not_load_table_stats_by_default
load_table_stats_when_no_stats
test_table_stats_behaviors
test_assert_list_files_for_scan_grouping
test_assert_list_files_for_multi_path
test_assert_list_files_for_exact_paths
test_list_files_configurations
test_insert_into_append_new_json_files
test_insert_into_append_new_csv_files
test_insert_into_append_2_new_parquet_files_defaults
test_insert_into_append_1_new_parquet_files_defaults
test_insert_into_parameterized

Are these changes tested?

  • Yes. This PR includes a full suite of unit tests covering:
    • Single-file + stats: read_single_file, test_table_stats_behaviors
    • Schema-source tracking: test_schema_source_tracking_comprehensive, infer_preserves_provided_schema
    • File-listing grouping: test_list_files_configurations
    • Insert-into append behavior: test_insert_into_parameterized

Shared helpers and parameterized loops ensure that every previously tested scenario is still exercised, with improved maintainability and coverage.

Are there any user-facing changes?

Yes:

  • ListingTable now exposes a schema_source() method, enabling downstream consumers to programmatically check the origin of the schema.
  • This may help users understand or debug unexpected schema-related behavior when working with listing tables.

There are no breaking changes to the public API, but the enhancement provides improved introspection capabilities.

@github-actions github-actions bot added the core Core DataFusion crate label Jun 6, 2025
@kosiew kosiew marked this pull request as draft June 6, 2025 14:15
@kosiew kosiew marked this pull request as ready for review June 7, 2025 08:14
@kosiew kosiew force-pushed the list-table-config-file-schema-16270 branch from 47fcdae to a40a448 Compare June 7, 2025 16:05
Comment on lines +2492 to +2509
let opt_enabled = ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_collect_stat(true);
let schema_enabled = opt_enabled.infer_schema(&state, &table_path).await?;
let config_enabled = ListingTableConfig::new(table_path)
.with_listing_options(opt_enabled)
.with_schema(schema_enabled);
let table_enabled = ListingTable::try_new(config_enabled)?;

let exec_enabled = table_enabled.scan(&state, None, &[], None).await?;
assert_eq!(
exec_enabled.partition_statistics(None)?.num_rows,
Precision::Exact(8)
);
// TODO correct byte size: https://github.com/apache/datafusion/issues/14936
assert_eq!(
exec_enabled.partition_statistics(None)?.total_byte_size,
Precision::Exact(671)
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from previous do_not_load_table_stats_by_default

        let opt = ListingOptions::new(Arc::new(ParquetFormat::default()))
            .with_collect_stat(true);
        let schema = opt.infer_schema(&state, &table_path).await?;
        let config = ListingTableConfig::new(table_path)
            .with_listing_options(opt)
            .with_schema(schema);
        let table = ListingTable::try_new(config)?;

        let exec = table.scan(&state, None, &[], None).await?;
        assert_eq!(
            exec.partition_statistics(None)?.num_rows,
            Precision::Exact(8)
        );
        // TODO correct byte size: https://github.com/apache/datafusion/issues/14936
        assert_eq!(
            exec.partition_statistics(None)?.total_byte_size,
            Precision::Exact(671)
        );

Comment on lines +2454 to +2471
let opt_default = ListingOptions::new(Arc::new(ParquetFormat::default()));
let schema_default = opt_default.infer_schema(&state, &table_path).await?;
let config_default = ListingTableConfig::new(table_path.clone())
.with_listing_options(opt_default)
.with_schema(schema_default);
let table_default = ListingTable::try_new(config_default)?;

let exec_default = table_default.scan(&state, None, &[], None).await?;
assert_eq!(
exec_default.partition_statistics(None)?.num_rows,
Precision::Absent
);

// TODO correct byte size: https://github.com/apache/datafusion/issues/14936
assert_eq!(
exec_default.partition_statistics(None)?.total_byte_size,
Precision::Absent
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace this excerpt from previous do_not_load_table_stats_by_default

        let opt = ListingOptions::new(Arc::new(ParquetFormat::default()));
        let schema = opt.infer_schema(&state, &table_path).await?;
        let config = ListingTableConfig::new(table_path.clone())
            .with_listing_options(opt)
            .with_schema(schema);
        let table = ListingTable::try_new(config)?;

        let exec = table.scan(&state, None, &[], None).await?;
        assert_eq!(exec.partition_statistics(None)?.num_rows, Precision::Absent);
        // TODO correct byte size: https://github.com/apache/datafusion/issues/14936
        assert_eq!(
            exec.partition_statistics(None)?.total_byte_size,
            Precision::Absent
        );

@xudong963 xudong963 self-requested a review June 9, 2025 09:54
/// Indicates the source of the schema for a [`ListingTable`]
/// PartialEq required for assert_eq! in tests
#[derive(Debug, Clone, PartialEq)]
pub enum SchemaSource {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this, the explicit representation definitely will reduce confusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% -- this is great

cc @adriangb

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @kosiew / cc @alamb

@kosiew kosiew changed the title Fix inconsistent schema projection in ListingTable when file order varies by tracking schema source Fix inconsistent schema projection in ListingTable even when schema is specified Jun 9, 2025
@kosiew
Copy link
Contributor Author

kosiew commented Jun 9, 2025

Thanks @xudong963 for the review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew and @xudong963 -- I agree this is a really nice improvement

/// Indicates the source of the schema for a [`ListingTable`]
/// PartialEq required for assert_eq! in tests
#[derive(Debug, Clone, PartialEq)]
pub enum SchemaSource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% -- this is great

cc @adriangb

/// See documentation and example on [`ListingTable`] and [`ListingTableConfig`]
pub fn try_new(config: ListingTableConfig) -> Result<Self> {
// Extract schema_source before moving other parts of the config
let schema_source = config.schema_source().clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: since SchemaSource is just a bare enum, it might also make sense to #derive(Copy) so we didn't have to clone it explicitly.

This is totally unecessary I just wanted to mention it as a possibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to #derive(Copy) so we didn't have to clone it explicitly.

💯 and implemented here.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew and @xudong963 -- I agree this is a really nice improvement

@kosiew kosiew force-pushed the list-table-config-file-schema-16270 branch from 0790f53 to 21ad8fb Compare June 9, 2025 13:43
@xudong963 xudong963 merged commit 2986415 into apache:main Jun 10, 2025
28 checks passed
@kosiew kosiew deleted the list-table-config-file-schema-16270 branch July 16, 2025 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistent schema coercion in ListingTableConfig

3 participants