perf: Implement spark_translate function to improve translate performance#2993
perf: Implement spark_translate function to improve translate performance#2993shuch3ng wants to merge 3 commits intoapache:mainfrom
translate performance#2993Conversation
…ranslate` expression
21ebf69 to
0363685
Compare
|
Thank you for the PR @shuch3ng , Any reason why we cant improve upstream (datafusion)'s |
|
Hi @coderfender. That's a good question. I implemented Upon checking DataFusion
For example: I'm not very familiar with the DataFusion project so I could be wrong, but pushing the changes upstream could affect other use cases in DataFusion. |
|
See my comment on #2976, the PR can be closed if it's confirmed as non-issue. w.r.t. the two differences mentioned above, they are out of the scope. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2993 +/- ##
============================================
+ Coverage 56.12% 59.58% +3.46%
- Complexity 976 1377 +401
============================================
Files 119 167 +48
Lines 11743 15493 +3750
Branches 2251 2569 +318
============================================
+ Hits 6591 9232 +2641
- Misses 4012 4962 +950
- Partials 1140 1299 +159 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Apologies for not reviewing this PR sooner. I agree with @shuch3ng that the Comet requirements are not compatible with the DataFusion implementation. It makes sense to implement in this repo, or to eventually upstream to the |
andygrove
left a comment
There was a problem hiding this comment.
Thanks for the contribution @shuch3ng.
This looks good overall, but my main concern is around argument pattern coverage. The implementation only handles the case where from and to are ScalarValue::Utf8(Some(...)), but CometScalarFunction("translate") sends all
argument patterns to the native side without filtering, and there's no runtime fallback to Spark. The existing SQL tests in string_translate.sql include queries like SELECT translate(s, from_str, to_str) FROM test_translate where all three arguments are columns — these would hit the error branch since the match only accepts [Array, Scalar, Scalar]. CI is still pending, but I'd expect those tests to fail.
A few ways this could be addressed: fall back to DataFusion's built-in translate for unmatched patterns (registering it from the registry for the general case), add a custom serde handler (like CometSubstring does) that only routes to the custom implementation when from/to are foldable, or handle all argument patterns in the Rust code.
Which issue does this PR close?
Closes #2976.
Rationale for this change
What changes are included in this PR?
This PR implements a custom
spark_translatefunction optimised for Spark's translate.How are these changes tested?
Unit tests are included
Benchmark: