-
Notifications
You must be signed in to change notification settings - Fork 1.9k
preserve Field metadata in first_value/last_value #19335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| Ok(arg_types[0].clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should have return_type return an internal error instead of leaving it implemented if we use return_field now
| } | ||
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| Ok(arg_types[0].clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
| /// UDF to extract metadata from a field for testing purposes | ||
| /// Usage: get_metadata(expr, 'key') -> returns the metadata value or NULL | ||
| #[derive(Debug, PartialEq, Eq, Hash)] | ||
| struct GetMetadataUdf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is useful enough to introduce as a function to datafusion itself, instead of being only in SLT? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to do so if you think it's worth it - I went with the conservative approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to do it in this PR, but worth filing a ticket for as followup
| 3 1 | ||
| NULL 1 | ||
|
|
||
| # Regression test: first_value should preserve metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed for the existing regression tests in this file, they don't actually check metadata 🤔
datafusion/datafusion/sqllogictest/test_files/metadata.slt
Lines 63 to 67 in 58377bf
| # Regression test: prevent field metadata loss per https://github.com/apache/datafusion/issues/12687 | |
| query I | |
| select count(distinct name) from table_with_metadata; | |
| ---- | |
| 2 |
With this new UDF we can look into updating those tests to be more similar to the ones introduced here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will file a ticket before we merge this to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests I think test that there is metadata on the input tables (rather than the output tables)
I do really like the idea of adding a UDF, simlarly to arrow_typeof that can show the metadata
``sql
select arrow_typeof('foo');
+---------------------------+
| arrow_typeof(Utf8("foo")) |
+---------------------------+
| Utf8 |
+---------------------------+
1 row(s) fetched.
Elapsed 0.024 seconds.
Possibilities: add a new argument
```sql
> select arrow_typeof('foo', true);
Possibility: add a new function
> select arrow_metadata('foo');There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would help debug various metadata issues more easily
I can file a ticket if you think this is reasonable
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 3 1 | ||
| NULL 1 | ||
|
|
||
| # Regression test: first_value should preserve metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests I think test that there is metadata on the input tables (rather than the output tables)
I do really like the idea of adding a UDF, simlarly to arrow_typeof that can show the metadata
``sql
select arrow_typeof('foo');
+---------------------------+
| arrow_typeof(Utf8("foo")) |
+---------------------------+
| Utf8 |
+---------------------------+
1 row(s) fetched.
Elapsed 0.024 seconds.
Possibilities: add a new argument
```sql
> select arrow_typeof('foo', true);
Possibility: add a new function
> select arrow_metadata('foo');| 3 1 | ||
| NULL 1 | ||
|
|
||
| # Regression test: first_value should preserve metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would help debug various metadata issues more easily
I can file a ticket if you think this is reasonable
| } | ||
|
|
||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| // Get the metadata key from the second argument (must be a string literal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would also be nice if we supported a single column version that returned the metadata as a struct array too
| ---- | ||
| 1 the id field NULL the name field | ||
| 3 the id field baz the name field | ||
| NULL the id field bar the name field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if a non-existing column is passed as a first argument of the get_metadata() udf ? Or a scalar value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return an empty map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or NULL.
I mean it would be good to have some negative test cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if a non-existing column is passed as a first argument of the get_metadata() udf
I think it should probably error like any other query that tries to access a undefined column
|
I opened #19356 to track generalizing the function added in this PR. |
|
@erratic-pattern is looking at something similar for us upstream in InfluxDB |
Which issue does this PR close?
Closes #19336
Rationale for this change
The
first_valueandlast_valueaggregate functions were not preserving Field metadata from their input arguments. This caused metadata to be lost when using these functions, which affects downstream consumers that rely on metadata (e.g., for DISTINCT ON queries which usefirst_valueinternally).What changes are included in this PR?
return_field()forFirstValueto preserve input field metadatareturn_field()forLastValueto preserve input field metadataget_metadataUDF for testing metadata preservation in sqllogictestfirst_value,last_value,DISTINCT ON,DISTINCT, and grouped columnsAre these changes tested?
Yes, new sqllogictest tests are added in
metadata.sltthat verify metadata is preserved through various aggregate operations.Are there any user-facing changes?
Yes, Field metadata is now correctly preserved when using
first_value()andlast_value()aggregate functions. This is a bug fix that improves metadata propagation.🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com