-
Notifications
You must be signed in to change notification settings - Fork 1.9k
preserve Field metadata in first_value/last_value #19335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,7 +32,7 @@ use arrow::record_batch::RecordBatch; | |
| use datafusion::catalog::{ | ||
| CatalogProvider, MemoryCatalogProvider, MemorySchemaProvider, Session, | ||
| }; | ||
| use datafusion::common::{not_impl_err, DataFusionError, Result}; | ||
| use datafusion::common::{exec_err, not_impl_err, DataFusionError, Result, ScalarValue}; | ||
| use datafusion::functions::math::abs; | ||
| use datafusion::logical_expr::async_udf::{AsyncScalarUDF, AsyncScalarUDFImpl}; | ||
| use datafusion::logical_expr::{ | ||
|
|
@@ -398,6 +398,58 @@ pub async fn register_metadata_tables(ctx: &SessionContext) { | |
| .unwrap(); | ||
|
|
||
| ctx.register_batch("table_with_metadata", batch).unwrap(); | ||
|
|
||
| // Register the get_metadata UDF for testing metadata preservation | ||
| ctx.register_udf(ScalarUDF::from(GetMetadataUdf::new())); | ||
| } | ||
|
|
||
| /// UDF to extract metadata from a field for testing purposes | ||
| /// Usage: get_metadata(expr, 'key') -> returns the metadata value or NULL | ||
| #[derive(Debug, PartialEq, Eq, Hash)] | ||
| struct GetMetadataUdf { | ||
| signature: Signature, | ||
| } | ||
|
|
||
| impl GetMetadataUdf { | ||
| fn new() -> Self { | ||
| Self { | ||
| signature: Signature::any(2, Volatility::Immutable), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl ScalarUDFImpl for GetMetadataUdf { | ||
| fn as_any(&self) -> &dyn Any { | ||
| self | ||
| } | ||
|
|
||
| fn name(&self) -> &str { | ||
| "get_metadata" | ||
| } | ||
|
|
||
| fn signature(&self) -> &Signature { | ||
| &self.signature | ||
| } | ||
|
|
||
| fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | ||
| Ok(DataType::Utf8) | ||
| } | ||
|
|
||
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { | ||
| // Get the metadata key from the second argument (must be a string literal) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would also be nice if we supported a single column version that returned the metadata as a struct array too |
||
| let key = match &args.args[1] { | ||
| ColumnarValue::Scalar(ScalarValue::Utf8(Some(k))) => k.clone(), | ||
| _ => { | ||
| return exec_err!("get_metadata second argument must be a string literal") | ||
| } | ||
| }; | ||
|
|
||
| // Get metadata from the first argument's field | ||
| let metadata_value = args.arg_fields[0].metadata().get(&key).cloned(); | ||
|
|
||
| // Return as a scalar (same value for all rows) | ||
| Ok(ColumnarValue::Scalar(ScalarValue::Utf8(metadata_value))) | ||
| } | ||
| } | ||
|
|
||
| /// Create a UDF function named "example". See the `sample_udf.rs` example | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -235,7 +235,56 @@ order by 1 asc nulls last; | |||||||||||
| 3 1 | ||||||||||||
| NULL 1 | ||||||||||||
|
|
||||||||||||
| # Regression test: first_value should preserve metadata | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I noticed for the existing regression tests in this file, they don't actually check metadata 🤔 datafusion/datafusion/sqllogictest/test_files/metadata.slt Lines 63 to 67 in 58377bf
With this new UDF we can look into updating those tests to be more similar to the ones introduced here
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will file a ticket before we merge this to do so
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the tests I think test that there is metadata on the input tables (rather than the output tables) I do really like the idea of adding a UDF, simlarly to ``sql
Possibility: add a new function > select arrow_metadata('foo');
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this would help debug various metadata issues more easily I can file a ticket if you think this is reasonable |
||||||||||||
| query IT | ||||||||||||
| select first_value(id order by id asc nulls last), get_metadata(first_value(id order by id asc nulls last), 'metadata_key') | ||||||||||||
| from table_with_metadata; | ||||||||||||
| ---- | ||||||||||||
| 1 the id field | ||||||||||||
|
|
||||||||||||
| # Regression test: last_value should preserve metadata | ||||||||||||
| query IT | ||||||||||||
| select last_value(id order by id asc nulls first), get_metadata(last_value(id order by id asc nulls first), 'metadata_key') | ||||||||||||
| from table_with_metadata; | ||||||||||||
| ---- | ||||||||||||
| 3 the id field | ||||||||||||
|
|
||||||||||||
| # Regression test: DISTINCT ON should preserve metadata (uses first_value internally) | ||||||||||||
| query ITTT | ||||||||||||
| select distinct on (id) id, get_metadata(id, 'metadata_key'), name, get_metadata(name, 'metadata_key') | ||||||||||||
| from table_with_metadata order by id asc nulls last; | ||||||||||||
| ---- | ||||||||||||
| 1 the id field NULL the name field | ||||||||||||
| 3 the id field baz the name field | ||||||||||||
| NULL the id field bar the name field | ||||||||||||
|
|
||||||||||||
| # Regression test: DISTINCT should preserve metadata | ||||||||||||
| query ITTT | ||||||||||||
| with res AS ( | ||||||||||||
| select distinct id, name from table_with_metadata | ||||||||||||
| ) | ||||||||||||
| select id, get_metadata(id, 'metadata_key'), name, get_metadata(name, 'metadata_key') | ||||||||||||
| from res | ||||||||||||
| order by id asc nulls last; | ||||||||||||
| ---- | ||||||||||||
| 1 the id field NULL the name field | ||||||||||||
| 3 the id field baz the name field | ||||||||||||
| NULL the id field bar the name field | ||||||||||||
|
|
||||||||||||
| # Regression test: grouped columns should preserve metadata | ||||||||||||
| query ITTT | ||||||||||||
| with res AS ( | ||||||||||||
| select name, count(*), id | ||||||||||||
| from table_with_metadata | ||||||||||||
| group by id, name | ||||||||||||
| ) | ||||||||||||
| select id, get_metadata(id, 'metadata_key'), name, get_metadata(name, 'metadata_key') | ||||||||||||
| from res | ||||||||||||
| order by id asc nulls last, name asc nulls last | ||||||||||||
| ---- | ||||||||||||
| 1 the id field NULL the name field | ||||||||||||
| 3 the id field baz the name field | ||||||||||||
| NULL the id field bar the name field | ||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would happen if a non-existing column is passed as a first argument of the get_metadata() udf ? Or a scalar value.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Return an empty map?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or NULL.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think it should probably error like any other query that tries to access a undefined column |
||||||||||||
|
|
||||||||||||
| statement ok | ||||||||||||
| drop table table_with_metadata; | ||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is useful enough to introduce as a function to datafusion itself, instead of being only in SLT? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to do so if you think it's worth it - I went with the conservative approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to do it in this PR, but worth filing a ticket for as followup