Relax physical schema validation by findepi · Pull Request #14519 · apache/datafusion

findepi · 2025-02-05T21:57:14Z

Physical plan can be further optimized. In particular, an expression can be determined as never null even if it wasn't known at the time of logical planning. Thus, the final schema check needs to be relax, allowing now-non-null data where nullable data was expected. This replaces schema equality check, with asymmetric "is satisfied by" relation.

implements nullable Expr being constant fold to value can cause schema change and internal error #13190 (comment)
fixes nullable Expr being constant fold to value can cause schema change and internal error #13190

findepi · 2025-02-05T21:58:55Z

cc @eejbyfeldt @tv42 @comphead @jayzhan211

Physical plan can be further optimized. In particular, an expression can be determined as never null even if it wasn't known at the time of logical planning. Thus, the final schema check needs to be relax, allowing now-non-null data where nullable data was expected. This replaces schema equality check, with asymmetric "is satisfied by" relation.

davisp

This looks good to me, but I don't have all of the context to give it an actual +1. I just had some free time and decided to take up @alamb's call for more folks reviewing PRs.

datafusion/core/src/schema_equivalence.rs

davisp · 2025-02-05T23:04:08Z

datafusion/core/src/schema_equivalence.rs

+        // TODO (DataType::Union(, _), DataType::Union(_, _)) => {}
+        // TODO (DataType::Dictionary(_, _), DataType::Dictionary(_, _)) => {}
+        // TODO (DataType::Map(_, _), DataType::Map(_, _)) => {}
+        // TODO (DataType::RunEndEncoded(_, _), DataType::RunEndEncoded(_, _)) => {}


Is there a reason to not add these as part of this PR that I'm missing?

laziness and avoiding pr scope creep. i wanted to get structure clear and decided upon first

for example, it's not totally obvious we should be recursing into types at all. i think we should, but that's the decision being made.

Fair enough!

comphead · 2025-02-06T20:56:37Z

datafusion/core/src/physical_planner.rs

                            differences.push(format!("field data type at index {} [{}]: (physical) {} vs (logical) {}", i, physical_field.name(), physical_field.data_type(), logical_field.data_type()));
                        }
-                        if physical_field.is_nullable() != logical_field.is_nullable() {
+                        if physical_field.is_nullable() && !logical_field.is_nullable() {


like it! I still don't get why we still check nullability in schemas equivalence, 🤔 logical a physical schema can be derived differently and nullable sometimes derived in different way as well.

Nullability checks was a source of dozens problems on schema mismatch especially for UNION

Likely only a few case like Union is exception, most of the case doesn't change nullability

logical a physical schema can be derived differently and nullable sometimes derived in different way as well.

agreed, but the earlier delivered schema acts as a contract (promise) for a later delivered schema
if we told the world that expr won't contain null values, we can't change the mind at physical planning time. it violates the constraint (promise / contract)
if we told the world that expr may contain null values, we didn't promise that it will contain null values, and we may happen to produce no null values (and even be aware of that)

comphead · 2025-02-06T20:58:44Z

datafusion/core/src/schema_equivalence.rs

+/// schemas except that original schema can have nullable fields where candidate
+/// is constrained to not provide null data.
+pub(crate) fn schema_satisfied_by(original: &Schema, candidate: &Schema) -> bool {
+    original.metadata() == candidate.metadata()


wondering, do we really need to compare metadata? if it works for now we can have it, but since metadata is not strongly typed in fact just a HashMap<String, String> it might be an issue if from logical/physical schema someone decides to store something in there.

i agree.
note that the bottom line, aka the original behavior, is the schema Eq check, which includes metadata equality check.
in this PR i wanted to relax nullability checks only.

jayzhan211

👍🏻

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 5, 2025

findepi mentioned this pull request Feb 5, 2025

nullable Expr being constant fold to value can cause schema change and internal error #13190

Closed

findepi force-pushed the findepi/relax-physical-schema-validation-dc64f7 branch from e33e84a to 387b568 Compare February 5, 2025 21:59

findepi force-pushed the findepi/relax-physical-schema-validation-dc64f7 branch from 387b568 to fa37ee5 Compare February 5, 2025 22:08

davisp reviewed Feb 5, 2025

View reviewed changes

findepi requested review from alamb, comphead and jonahgao February 6, 2025 20:37

findepi mentioned this pull request Feb 6, 2025

Test DataFusion 45.0.0 with Sail #14408

Closed

comphead reviewed Feb 6, 2025

View reviewed changes

jayzhan211 approved these changes Feb 7, 2025

View reviewed changes

comphead merged commit 479a277 into apache:main Feb 7, 2025
25 checks passed

findepi deleted the findepi/relax-physical-schema-validation-dc64f7 branch February 7, 2025 17:09

jayzhan211 mentioned this pull request Feb 11, 2025

bug: improve schema checking for insert into cases #14572

Merged

v0y4g3r mentioned this pull request Mar 6, 2025

0.12 doesn't support GROUP BY in CTE query GreptimeTeam/greptimedb#5659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax physical schema validation#14519

Relax physical schema validation#14519
comphead merged 1 commit intoapache:mainfrom
findepi:findepi/relax-physical-schema-validation-dc64f7

findepi commented Feb 5, 2025 •

edited

Loading

Uh oh!

findepi commented Feb 5, 2025

Uh oh!

davisp left a comment

Uh oh!

Uh oh!

davisp Feb 5, 2025

Uh oh!

findepi Feb 6, 2025

Uh oh!

davisp Feb 6, 2025

Uh oh!

comphead Feb 6, 2025

Uh oh!

jayzhan211 Feb 7, 2025 •

edited

Loading

Uh oh!

findepi Feb 7, 2025

Uh oh!

comphead Feb 6, 2025

Uh oh!

findepi Feb 7, 2025

Uh oh!

jayzhan211 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

findepi commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

findepi commented Feb 5, 2025

Uh oh!

davisp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davisp Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

findepi Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

davisp Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

comphead Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

jayzhan211 Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

comphead Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

findepi Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

jayzhan211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

findepi commented Feb 5, 2025 •

edited

Loading

jayzhan211 Feb 7, 2025 •

edited

Loading