Skip to content

Implement comparisons on nested data types such that distinct/except would work#11117

Merged
alamb merged 1 commit intoapache:mainfrom
buoyant-data:issue-10749-only
Jun 27, 2024
Merged

Implement comparisons on nested data types such that distinct/except would work#11117
alamb merged 1 commit intoapache:mainfrom
buoyant-data:issue-10749-only

Conversation

@rtyler
Copy link
Copy Markdown
Contributor

@rtyler rtyler commented Jun 25, 2024

Which issue does this PR close?

Closes #10749

Rationale for this change

This relies on newer functionality in arrow 52 and allows DataFrame.except() to properly work on schemas with structs and lists. I'm not sure if this is the appropriate way to handle this change per se, but I included the regression case from the issue as a test in order to demonstrate the correction of the issue

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…would work

This relies on newer functionality in arrow 52 and allows
DataFrame.except() to properly work on schemas with structs and lists

Closes apache#10749
@github-actions github-actions Bot added the core Core DataFusion crate label Jun 25, 2024
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rtyler -- I think this is a nice improvement. I left some suggestions on how to improve comments / naming but I do think they could go in a follow on PR

It might also make sense to see if there are other kernels which need the same handling (e.g. eq_dyn for example)

if left.data_type().is_nested() && null_equals_null {
let cmp = make_comparator(left, right, SortOptions::default())?;
let len = left.len().min(right.len());
let values = (0..len).map(|i| cmp(i, i).is_eq()).collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely quite slow as it will be doing dynamic dispatch per row.

However, slow is better than not working at first.

Could you please: update the name of the function to reflect it isn't just for null anymore? Perhaps we could rename it to eq_dyn or something more generic

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think other than the potential rename the PR is ready to go -- however I also think we could do the rename as a follow on PR

Note @jayzhan211 added similiar code to handle nested comparisons in eq_datum in #11091 -- I wonder if we would consolidate those implementations somehow

Copy link
Copy Markdown
Contributor

@jayzhan211 jayzhan211 Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do the comparison with datum function, I move it to physical-common in #11091
It will be a nice alternative for equal_rows_arr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in #11149

Comment thread datafusion/physical-plan/src/joins/hash_join.rs
@rtyler
Copy link
Copy Markdown
Contributor Author

rtyler commented Jun 27, 2024

I am quite indifferent to the solution here as long as #10749 is resolved 😄

Happy to have this closed out in favor of a better implementation!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 27, 2024

I am quite indifferent to the solution here as long as #10749 is resolved 😄

Happy to have this closed out in favor of a better implementation!

This PR is great and I think a step forward (the code no longer errors!)

I'll make a follow on PR to try and simplify the implementation.

@alamb alamb merged commit d2ff218 into apache:main Jun 27, 2024
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 27, 2024

Thanks again @rtyler and @jayzhan211

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 27, 2024

Filed #11149 with a proposed simpler implementation

@rtyler rtyler deleted the issue-10749-only branch June 27, 2024 23:38
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…would work (apache#11117)

This relies on newer functionality in arrow 52 and allows
DataFrame.except() to properly work on schemas with structs and lists

Closes apache#10749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataFrame.except() does not work with structs in schema

3 participants