Relax physical schema validation#14519
Conversation
e33e84a to
387b568
Compare
Physical plan can be further optimized. In particular, an expression can be determined as never null even if it wasn't known at the time of logical planning. Thus, the final schema check needs to be relax, allowing now-non-null data where nullable data was expected. This replaces schema equality check, with asymmetric "is satisfied by" relation.
387b568 to
fa37ee5
Compare
| // TODO (DataType::Union(, _), DataType::Union(_, _)) => {} | ||
| // TODO (DataType::Dictionary(_, _), DataType::Dictionary(_, _)) => {} | ||
| // TODO (DataType::Map(_, _), DataType::Map(_, _)) => {} | ||
| // TODO (DataType::RunEndEncoded(_, _), DataType::RunEndEncoded(_, _)) => {} |
There was a problem hiding this comment.
Is there a reason to not add these as part of this PR that I'm missing?
There was a problem hiding this comment.
laziness and avoiding pr scope creep. i wanted to get structure clear and decided upon first
for example, it's not totally obvious we should be recursing into types at all. i think we should, but that's the decision being made.
| differences.push(format!("field data type at index {} [{}]: (physical) {} vs (logical) {}", i, physical_field.name(), physical_field.data_type(), logical_field.data_type())); | ||
| } | ||
| if physical_field.is_nullable() != logical_field.is_nullable() { | ||
| if physical_field.is_nullable() && !logical_field.is_nullable() { |
There was a problem hiding this comment.
like it! I still don't get why we still check nullability in schemas equivalence, 🤔 logical a physical schema can be derived differently and nullable sometimes derived in different way as well.
Nullability checks was a source of dozens problems on schema mismatch especially for UNION
There was a problem hiding this comment.
Likely only a few case like Union is exception, most of the case doesn't change nullability
There was a problem hiding this comment.
logical a physical schema can be derived differently and nullable sometimes derived in different way as well.
agreed, but the earlier delivered schema acts as a contract (promise) for a later delivered schema
if we told the world that expr won't contain null values, we can't change the mind at physical planning time. it violates the constraint (promise / contract)
if we told the world that expr may contain null values, we didn't promise that it will contain null values, and we may happen to produce no null values (and even be aware of that)
| /// schemas except that original schema can have nullable fields where candidate | ||
| /// is constrained to not provide null data. | ||
| pub(crate) fn schema_satisfied_by(original: &Schema, candidate: &Schema) -> bool { | ||
| original.metadata() == candidate.metadata() |
There was a problem hiding this comment.
wondering, do we really need to compare metadata? if it works for now we can have it, but since metadata is not strongly typed in fact just a HashMap<String, String> it might be an issue if from logical/physical schema someone decides to store something in there.
There was a problem hiding this comment.
i agree.
note that the bottom line, aka the original behavior, is the schema Eq check, which includes metadata equality check.
in this PR i wanted to relax nullability checks only.
Physical plan can be further optimized. In particular, an expression can be determined as never null even if it wasn't known at the time of logical planning. Thus, the final schema check needs to be relax, allowing now-non-null data where nullable data was expected. This replaces schema equality check, with asymmetric "is satisfied by" relation.