Skip to content

feat(spark): implement StringView for SparkConcat#19984

Merged
Jefffrey merged 4 commits intoapache:mainfrom
aryan-212:utf8view-sparkconcat
Jan 27, 2026
Merged

feat(spark): implement StringView for SparkConcat#19984
Jefffrey merged 4 commits intoapache:mainfrom
aryan-212:utf8view-sparkconcat

Conversation

@aryan-212
Copy link
Copy Markdown
Contributor

@aryan-212 aryan-212 commented Jan 25, 2026

Which issue does this PR close?

  • This PR is part of the Utf8View support epic. It adds Utf8View support in the Spark-compat layer.

Rationale for this change

In our internal project we're only suppporting Utf8View (because of design constraints) and the current implementation of SparkConcat only supports Utf8. The SparkConcat function should accept Utf8View and mixed string types in line with the main DataFusion concat. This PR adds that support and follows the same patterns as DataFusion’s concat.

Prevents errors like :

The type of Utf8 AND Utf8View of like physical should be same.
This issue was likely caused by a bug in DataFusion's code. Please help us to resolve this by filing a bug report in our issue tracker: https://github.com/apache/datafusion/issues

from a query like:-

select i_item_sk,
       item_info
from
  (select i_item_sk,
          CONCAT('Item: ', i_item_desc) as item_info
   from item) sub
where item_info LIKE 'Item: Electronic%'
order by 1;

What changes are included in this PR?

  • Extend the type signature to accept Utf8View in addition to Utf8 and LargeUtf8 via TypeSignature::Variadic(vec![Utf8View, Utf8, LargeUtf8]) matching DataFusion’s concat.

  • In return_field_from_args, compute the result type with precedence Utf8View > LargeUtf8 > Utf8.
    In spark_concat, handle Utf8View and LargeUtf8 in scalar paths (zero-argument and all-NULL).

Are these changes tested?

Yes.

  • Unit tests: cargo test --package datafusion-spark function::string::concat::tests, including test_concat_utf8view.
  • Sqllogictest: spark/string/concat.slt includes a “Utf8View: no extra CAST in plan” case that uses EXPLAIN and a temporary table to ensure no extra CASTs when using arrow_cast(..., 'Utf8View') with table columns.

Are there any user-facing changes?

  • API: SparkConcat’s signature is extended to include Utf8View in the variadic list. No breaking changes.

used gpt to rephrase some of these points

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) spark labels Jan 25, 2026
@aryan-212 aryan-212 force-pushed the utf8view-sparkconcat branch from fc92fa6 to e052fa3 Compare January 25, 2026 08:27
@aryan-212 aryan-212 changed the title feat: implement StringView for SparkConcat feat(spark): implement StringView for SparkConcat Jan 25, 2026
Copy link
Copy Markdown
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @aryan-212 👍

It is better to add some mixed type tests

# Test mixed types: Utf8View + Utf8
query T
SELECT concat(arrow_cast('hello', 'Utf8View'), ' world');
----
hello world

# Test all three types mixed together
query T
SELECT concat('a', arrow_cast('b', 'LargeUtf8'), arrow_cast('c', 'Utf8View'));
----
abc

@aryan-212 aryan-212 force-pushed the utf8view-sparkconcat branch 3 times, most recently from d791634 to 9329a43 Compare January 25, 2026 10:28
@aryan-212
Copy link
Copy Markdown
Contributor Author

Added them @Weijun-H, thanks for reviewing 🙇

Comment thread datafusion/spark/src/function/string/concat.rs
Comment thread datafusion/spark/src/function/string/concat.rs
@aryan-212 aryan-212 force-pushed the utf8view-sparkconcat branch 2 times, most recently from ad92725 to 7bb3ba0 Compare January 25, 2026 15:10
@aryan-212 aryan-212 requested a review from Jefffrey January 25, 2026 15:10
@aryan-212 aryan-212 force-pushed the utf8view-sparkconcat branch from 7bb3ba0 to 02362fe Compare January 25, 2026 15:21
Comment thread datafusion/spark/src/function/string/concat.rs Outdated
Comment thread datafusion/sqllogictest/test_files/spark/string/concat.slt Outdated
Comment thread datafusion/sqllogictest/test_files/spark/string/concat.slt Outdated
@aryan-212 aryan-212 force-pushed the utf8view-sparkconcat branch 2 times, most recently from 1c997dc to 28156b8 Compare January 25, 2026 17:16
@aryan-212
Copy link
Copy Markdown
Contributor Author

@Jefffrey , made the required test changes. Please have a look 🙇

Comment thread datafusion/spark/src/function/string/concat.rs Outdated
@aryan-212 aryan-212 force-pushed the utf8view-sparkconcat branch 3 times, most recently from 3715990 to 577b604 Compare January 25, 2026 17:40
@aryan-212 aryan-212 force-pushed the utf8view-sparkconcat branch from 577b604 to 1358178 Compare January 25, 2026 17:40
@Jefffrey Jefffrey added this pull request to the merge queue Jan 27, 2026
Merged via the queue into apache:main with commit f5709e7 Jan 27, 2026
28 checks passed
@Jefffrey
Copy link
Copy Markdown
Contributor

Thanks @aryan-212 & @Weijun-H

@aryan-212 aryan-212 deleted the utf8view-sparkconcat branch January 27, 2026 07:07
de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- This PR is part of the [Utf8View
support](apache#10918) epic. It
adds `Utf8View` support in the Spark-compat layer.

## Rationale for this change

In our internal project we're only suppporting `Utf8View` _(because of
design constraints)_ and the current implementation of `SparkConcat`
only supports `Utf8`. The `SparkConcat` function should accept
`Utf8View` and mixed string types in line with the main DataFusion
concat. This PR adds that support and follows the same patterns as
[DataFusion’s
concat](https://github.com/apache/datafusion/blob/main/datafusion/functions/src/string/concat.rs).

Prevents errors like : 

> The type of Utf8 AND Utf8View of like physical should be same.
> This issue was likely caused by a bug in DataFusion's code. Please
help us to resolve this by filing a bug report in our issue tracker:
https://github.com/apache/datafusion/issues

from a query like:-

```sql
select i_item_sk,
       item_info
from
  (select i_item_sk,
          CONCAT('Item: ', i_item_desc) as item_info
   from item) sub
where item_info LIKE 'Item: Electronic%'
order by 1;
```

 

## What changes are included in this PR?

- Extend the type signature to accept `Utf8View` in addition to `Utf8`
and `LargeUtf8` via `TypeSignature::Variadic(vec![Utf8View, Utf8,
LargeUtf8])` matching DataFusion’s concat.

- In `return_field_from_args`, compute the result type with precedence
Utf8View &gt; LargeUtf8 &gt; Utf8.
In spark_concat, handle Utf8View and LargeUtf8 in scalar paths
(zero-argument and all-NULL).


## Are these changes tested?

Yes.
- Unit tests: `cargo test --package datafusion-spark
function::string::concat::tests`, including `test_concat_utf8view`.
- Sqllogictest: `spark/string/concat.slt` includes a “**Utf8View: no
extra CAST in plan**” case that uses EXPLAIN and a temporary table to
ensure no extra CASTs when using arrow_cast(..., 'Utf8View') with table
columns.

## Are there any user-facing changes?

- **API:** SparkConcat’s signature is extended to include Utf8View in
the variadic list. No breaking changes.

_used gpt to rephrase some of these points_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants