perf: Optimize NULL handling in StringViewArrayBuilder#21538
perf: Optimize NULL handling in StringViewArrayBuilder#21538martin-g merged 5 commits intoapache:mainfrom
StringViewArrayBuilder#21538Conversation
|
FYI @Omega359 |
Omega359
left a comment
There was a problem hiding this comment.
I looked this over and it seems reasonable, thanks!
|
@martin-g @timsaucer Thanks for the reviews! I noticed that the existing code panicked for large input strings, so I adjusted that to return an error (via |
|
run benchmark concat |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing neilc/perf-string-view-builder-nulls (4e33f61) to 4389f14 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
@martin-g Is this okay to merge? |
|
Thanks @martin-g @timsaucer ! |
| } | ||
|
|
||
| pub fn append_offset(&mut self) { | ||
| pub fn append_offset(&mut self) -> Result<()> { |
There was a problem hiding this comment.
I think this is an API change we should call out (I'll add a api-change label)
There was a problem hiding this comment.
Is there a reason why StringViewArrayBuilder is pub, rather than pub(crate)? It is a fairly special-purpose API.
There was a problem hiding this comment.
TLDR is I don't know -- it got added for some very special purpose reason I can't remember (I think ideally we would use the upstream one in arrow-rs, but for some reason we needed a special copy. I can't remember why)
There was a problem hiding this comment.
I wrote out a quick plan for what I'd like to do here, see #21684 ; comments very welcome
## Which issue does this PR close? - Closes apache#21537. ## Rationale for this change `StringViewArrayBuilder` is implemented on top of Arrow's `StringViewBuilder`; the latter tracks NULLs incrementally. However, the `StringViewArrayBuilder` requires callers to pass a NULL buffer to `finish()` anyway, so the NULL bitmap that has been computed by `StringViewBuilder` is discarded. It would be more efficient to stop using `StringViewBuilder` so that we don't do this redundant work; in theory there might be room for inconsistency between the two NULL bitmaps as well. Right now, `StringViewArrayBuilder` is only used by the `concat` and `concat_ws` UDFs, but I'd like to generalize the API and use it more broadly in place of `StringViewBuilder` (apache#21539). For the time being, here are the results of this PR on the `concat` benchmarks (Arm64): ``` - 1024 rows: 29.6 µs → 28.0 µs, -5.3% - 4096 rows: 134.3 µs → 125.6 µs, -6.5% - 8192 rows: 289.7 µs → 273.5 µs, -5.6% ``` ## What changes are included in this PR? * Stop using `StringViewBuilder` and build the views ourselves * Improve some comments * Return an error instead of panicking on large input strings ## Are these changes tested? Yes. ## Are there any user-facing changes? No. --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
…e#21695) ## Which issue does this PR close? - Part of apache#21684 but does not close it. ## Rationale for this change Rename the three builders in `datafusion/functions/src/strings.rs` to make their special-purpose nature explicit: - `StringArrayBuilder` -> `ConcatStringBuilder` - `LargeStringArrayBuilder` -> `ConcatLargeStringBuilder` - `StringViewArrayBuilder` -> `ConcatStringViewBuilder` These builders are used only by `concat` and `concat_ws`, and their APIs are specific enough to the needs of `concat`-like callers that it seems unlikely that other call-sites will emerge in the future. This also frees the previous names of these builders to be used by a more broadly useful string builder API. Also make these types `pub(crate)`, instead of `pub`; there is no good reason to make them part of the public API. This is a breaking API change. ## What changes are included in this PR? * Rename types * Make types `pub(crate)`, not `pub` * Add section to v54 migration guide ## Are these changes tested? Yes. ## Are there any user-facing changes? Yes, breaking API change -- although these types should probably not have been part of the public API to begin with, and we already made a breaking API change to them as part of other work in v54 (apache#21538)
Which issue does this PR close?
StringViewArrayBuilder#21537.Rationale for this change
StringViewArrayBuilderis implemented on top of Arrow'sStringViewBuilder; the latter tracks NULLs incrementally. However, theStringViewArrayBuilderrequires callers to pass a NULL buffer tofinish()anyway, so the NULL bitmap that has been computed byStringViewBuilderis discarded. It would be more efficient to stop usingStringViewBuilderso that we don't do this redundant work; in theory there might be room for inconsistency between the two NULL bitmaps as well.Right now,
StringViewArrayBuilderis only used by theconcatandconcat_wsUDFs, but I'd like to generalize the API and use it more broadly in place ofStringViewBuilder(#21539). For the time being, here are the results of this PR on theconcatbenchmarks (Arm64):What changes are included in this PR?
StringViewBuilderand build the views ourselvesAre these changes tested?
Yes.
Are there any user-facing changes?
No.