Skip to content

fix: fall back to Spark for hash join and sort-merge join on non-default collated string keys [Spark 4]#4095

Open
0lai0 wants to merge 2 commits intoapache:mainfrom
0lai0:fix-4051-collated-join
Open

fix: fall back to Spark for hash join and sort-merge join on non-default collated string keys [Spark 4]#4095
0lai0 wants to merge 2 commits intoapache:mainfrom
0lai0:fix-4051-collated-join

Conversation

@0lai0
Copy link
Copy Markdown
Contributor

@0lai0 0lai0 commented Apr 26, 2026

Which issue does this PR close?

Closes #4051

Rationale for this change

Comet hash join and sort-merge join can produce wrong results when join keys use a non-default string collation. This PR makes those joins fall back to Spark instead of executing with byte-level equality.

What changes are included in this PR?

Add collation guards to:

  • CometHashJoin.doConvert
  • CometSortMergeJoinExec.supportedSortMergeJoinEqualType

How are these changes tested?

Added regression tests in CometCollationSuite for Broadcast Hash Join, Shuffled Hash Join and Sort-Merge Join collation handling
./mvnw -pl spark -Pspark-4.0 -DwildcardSuites=org.apache.spark.sql.CometCollationSuite test
./mvnw -pl spark -Pspark-4.0 -DwildcardSuites=org.apache.comet.exec.CometJoinSuite test

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @0lai0. I cloned the PR locally and confirmed that the new tests fail without the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broadcast hash join and sort-merge join can produce incorrect results on non-default collated keys [Spark 4]

3 participants