Skip to content

DataFrame::union() does not detect schema mismatches #13287

@ttencate

Description

@ttencate

Describe the bug

Using datafusion version 42.2.0.

Follow up to #13092, which was fixed by #13117 thanks to @Omega359.

However, this fix will not catch mistakes like reordered columns. For example, if table A has columns a, b and table B has columns b, a, then DataFusion will happily compute the union, with the wrong values in the wrong columns.

So why not just compare the entire schema? Or at least the column names and types (i.e. ignoring metadata)? The docs explicitly say that the schemas must be equal.

To Reproduce

#[tokio::test]
async fn test_union() {
    use crate::data_frame;
    use datafusion::assert_batches_sorted_eq;
    use datafusion::common::arrow::array::{ArrayRef, StringArray};
    use datafusion::common::arrow::record_batch::RecordBatch;
    use std::sync::Arc;

    let ctx = SessionContext::new();
    let a = ctx
        .read_batch(
            RecordBatch::try_from_iter([
                ("a", Arc::new(StringArray::from(vec!["a"])) as ArrayRef),
                ("b", Arc::new(StringArray::from(vec!["b"])) as ArrayRef),
            ])
            .unwrap(),
        )
        .unwrap();
    let b = ctx
        .read_batch(
            RecordBatch::try_from_iter([
                ("b", Arc::new(StringArray::from(vec!["b"])) as ArrayRef),
                ("a", Arc::new(StringArray::from(vec!["a"])) as ArrayRef),
            ])
            .unwrap(),
        )
        .unwrap();

    let union = a.union(b).unwrap();
    assert_batches_sorted_eq!(
        [
            "+---+---+",
            "| a | b |",
            "+---+---+",
            "| a | b |",
            "| a | b |",
            "+---+---+",
        ],
        &union.collect().await.unwrap()
    );
}

Expected behavior

Test passes.

Actual behavior

assertion `left == right` failed: 

expected:

[
    "+---+---+",
    "| a | b |",
    "+---+---+",
    "| a | b |",
    "| a | b |",
    "+---+---+",
]
actual:

[
    "+---+---+",
    "| a | b |",
    "+---+---+",
    "| a | b |",
    "| b | a |",
    "+---+---+",
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions