Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 15, 2025

Which issue does this PR close?

Closes #19336

Rationale for this change

The first_value and last_value aggregate functions were not preserving Field metadata from their input arguments. This caused metadata to be lost when using these functions, which affects downstream consumers that rely on metadata (e.g., for DISTINCT ON queries which use first_value internally).

What changes are included in this PR?

  • Implement return_field() for FirstValue to preserve input field metadata
  • Implement return_field() for LastValue to preserve input field metadata
  • Add get_metadata UDF for testing metadata preservation in sqllogictest
  • Add regression tests for first_value, last_value, DISTINCT ON, DISTINCT, and grouped columns

Are these changes tested?

Yes, new sqllogictest tests are added in metadata.slt that verify metadata is preserved through various aggregate operations.

Are there any user-facing changes?

Yes, Field metadata is now correctly preserved when using first_value() and last_value() aggregate functions. This is a bug fix that improves metadata propagation.


🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Dec 15, 2025
@adriangb adriangb requested a review from alamb December 15, 2025 16:09
adriangb added a commit to pydantic/datafusion that referenced this pull request Dec 15, 2025
}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
Ok(arg_types[0].clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should have return_type return an internal error instead of leaving it implemented if we use return_field now

}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
Ok(arg_types[0].clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

/// UDF to extract metadata from a field for testing purposes
/// Usage: get_metadata(expr, 'key') -> returns the metadata value or NULL
#[derive(Debug, PartialEq, Eq, Hash)]
struct GetMetadataUdf {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is useful enough to introduce as a function to datafusion itself, instead of being only in SLT? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do so if you think it's worth it - I went with the conservative approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do it in this PR, but worth filing a ticket for as followup

3 1
NULL 1

# Regression test: first_value should preserve metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed for the existing regression tests in this file, they don't actually check metadata 🤔

# Regression test: prevent field metadata loss per https://github.com/apache/datafusion/issues/12687
query I
select count(distinct name) from table_with_metadata;
----
2

With this new UDF we can look into updating those tests to be more similar to the ones introduced here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will file a ticket before we merge this to do so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests I think test that there is metadata on the input tables (rather than the output tables)

I do really like the idea of adding a UDF, simlarly to arrow_typeof that can show the metadata

``sql

select arrow_typeof('foo');
+---------------------------+
| arrow_typeof(Utf8("foo")) |
+---------------------------+
| Utf8 |
+---------------------------+
1 row(s) fetched.
Elapsed 0.024 seconds.


Possibilities: add a new argument

```sql
> select arrow_typeof('foo', true);

Possibility: add a new function

> select arrow_metadata('foo');

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would help debug various metadata issues more easily

I can file a ticket if you think this is reasonable

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- I agree with all @Jefffrey 's comments too

Let me know if you want a follow up ticket

3 1
NULL 1

# Regression test: first_value should preserve metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests I think test that there is metadata on the input tables (rather than the output tables)

I do really like the idea of adding a UDF, simlarly to arrow_typeof that can show the metadata

``sql

select arrow_typeof('foo');
+---------------------------+
| arrow_typeof(Utf8("foo")) |
+---------------------------+
| Utf8 |
+---------------------------+
1 row(s) fetched.
Elapsed 0.024 seconds.


Possibilities: add a new argument

```sql
> select arrow_typeof('foo', true);

Possibility: add a new function

> select arrow_metadata('foo');

3 1
NULL 1

# Regression test: first_value should preserve metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would help debug various metadata issues more easily

I can file a ticket if you think this is reasonable

}

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
// Get the metadata key from the second argument (must be a string literal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be nice if we supported a single column version that returned the metadata as a struct array too

----
1 the id field NULL the name field
3 the id field baz the name field
NULL the id field bar the name field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if a non-existing column is passed as a first argument of the get_metadata() udf ? Or a scalar value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return an empty map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or NULL.
I mean it would be good to have some negative test cases too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if a non-existing column is passed as a first argument of the get_metadata() udf

I think it should probably error like any other query that tries to access a undefined column

@adriangb
Copy link
Contributor Author

I opened #19356 to track generalizing the function added in this PR.

@adriangb adriangb added this pull request to the merge queue Dec 16, 2025
Merged via the queue into apache:main with commit 50d20dd Dec 16, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 17, 2025

@erratic-pattern is looking at something similar for us upstream in InfluxDB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

first_value/last_value aggregate functions don't preserve Field metadata

4 participants