[draft] Add LogicalType, try to support user-defined types#8143
[draft] Add LogicalType, try to support user-defined types#8143yukkit wants to merge 1 commit intoapache:mainfrom
LogicalType, try to support user-defined types#8143Conversation
|
Current PR has some unresolved issues requiring collaboration for discussion. |
|
I've organized the logic for the mutual conversion between DFSchema to SchemaNo need to changeDefaultPhysicalPlanner
To be changed
Schema to DFSchema (To be changed)
|
|
Thanks @yukkit -- I plan to give this a look, but probably will not have time until tomorrow |
| } | ||
| } | ||
|
|
||
| async fn create_type(&self, cmd: CreateType) -> Result<DataFrame> { |
There was a problem hiding this comment.
This is the core processing logic for creating SQL UDT.
| self.register_data_type( | ||
| type_signature.to_owned_type_signature(), | ||
| extension_type, | ||
| )?; |
There was a problem hiding this comment.
Register the extension types created through SQL UDT.
| } | ||
| } | ||
|
|
||
| impl TypeManager for SessionState { |
There was a problem hiding this comment.
SessionState implements TymeManager for registering and retrieving extended data types.
| pub trait ExprSchemable { | ||
| /// given a schema, return the type of the expr | ||
| fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<DataType>; | ||
| fn get_type<S: ExprSchema>(&self, schema: &S) -> Result<LogicalType>; |
There was a problem hiding this comment.
Replace all arrow DataType used in the logical planning stage with LogicalType(literal ScalarValue TBD).
| // TODO not convert to DataType | ||
| let data_types = data_types | ||
| .into_iter() | ||
| .map(|e| e.physical_type()) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| Ok((fun.return_type)(&data_types)?.as_ref().clone().into()) |
There was a problem hiding this comment.
TODO: Functions signatures for udf/udaf use TypeSignature instead of the arrow DataType
| .ok_or_else(|| { | ||
| plan_datafusion_err!( | ||
| // FIXME support logical data types, not convert to DataType | ||
| let data_type = comparison_coercion( |
There was a problem hiding this comment.
Modify the function comparison_coercion to accept LogicalType as parameters.
| /// data types here come from function 'equal_rows', if more data types are supported | ||
| /// in equal_rows(hash join), add those data types here to generate join logical plan. | ||
| pub fn can_hash(data_type: &DataType) -> bool { | ||
| pub fn can_hash(data_type: &LogicalType) -> bool { |
There was a problem hiding this comment.
This function can_hash might be abstracted as a property of logical data types.
| // FIXME like_coercion use LogicalType | ||
| let left_type = expr.get_type(&self.schema)?.physical_type(); | ||
| let right_type = pattern.get_type(&self.schema)?.physical_type(); | ||
| let coerced_type = like_coercion(&left_type, &right_type).ok_or_else(|| { |
There was a problem hiding this comment.
TODO: Modify the function like_coercion to accept LogicalType as parameters.
| } | ||
|
|
||
| pub(crate) fn convert_data_type(&self, sql_type: &SQLDataType) -> Result<DataType> { | ||
| pub(crate) fn convert_data_type( |
There was a problem hiding this comment.
Convert data types in the AST to logical data types.
| SQLDataType::Custom(name, params) => { | ||
| let name = object_name_to_string(name); | ||
| let params = params.iter().map(Into::into).collect(); | ||
| let type_signature = TypeSignature::new_with_params(name, params); | ||
| if let Some(data_type) = self.context_provider.get_data_type(&type_signature) { | ||
| return Ok(data_type); | ||
| } | ||
|
|
||
| plan_err!("User-Defined SQL type {sql_type:?} not found") | ||
| } |
There was a problem hiding this comment.
Convert custom data types.
LogicalType, try to support user-defined types
alamb
left a comment
There was a problem hiding this comment.
This is an epic PR @yukkit -- I looked through it and I really liked the structure and the overall code. Really nice 👏
I think the next steps for this PR are to circulate it more widely -- perhaps via a note to the arrow mailing list, and discord / slack channels (I can help with this) and potentially have people gauge how much disruption this will cause with downstream crates (I can test with IOx perhaps, and see what that experience is like)
Again, thanks for pushing this forward
| _ => dt1 == dt2, | ||
| } | ||
| fn datatype_is_logically_equal(dt1: &LogicalType, dt2: &LogicalType) -> bool { | ||
| dt1 == dt2 |
| data_type: LogicalType, | ||
| nullable: bool, | ||
| /// A map of key-value pairs containing additional custom meta data. | ||
| metadata: HashMap<String, String>, |
There was a problem hiding this comment.
Maybe we could store in a BTreeMap to avoid sorting them each time 🤔
| use arrow_schema::{DataType, Field, IntervalUnit, TimeUnit}; | ||
|
|
||
| #[derive(Clone, Debug)] | ||
| pub enum LogicalType { |
There was a problem hiding this comment.
Eventually we should add doc comments here, but it also makes sense to avoid over doing it on RFC / draft.
| .collect::<Vec<_>>(); | ||
| DataType::Struct(fields.into()) | ||
| } | ||
| other => panic!("not implemented {other:?}"), |
There was a problem hiding this comment.
is the idea that DataType::Dictionary and DataType::RunEndEncoded would also be included here? If so it makes sense
There was a problem hiding this comment.
Thanks for the heads up. This should include DataType::Dictionary and DataType::RunEndEncoded here.
| pub format: Arc<dyn FileFormat>, | ||
| /// The expected partition column names in the folder structure. | ||
| /// See [Self::with_table_partition_cols] for details | ||
| /// TODO this maybe LogicalType |
There was a problem hiding this comment.
I think one usecase is to use dictionary encoding for the partition columns to minimize the overhead of creating such columns. As long as they can be represented / specified with LogicalType I think it is a good idea to try changing this too.
|
What's the status of this pr? This should be a very useful feature. |
|
I think this PR is stalled and I don't have any update |
|
Please accept my apologies for the delay. Due to personal circumstances, I have been unable to attend to any work. I will now proceed to resume work on this PR. |
|
No worries at all -- I hope all is well and we look forward to this work |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
Hello, sorry if this is a redundant question. What is the status of this PR? |
I think it is stale and on track to be closed from what I can see |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
|
FYI #11160 tracks a new proposal for this feature. It seems to be gaining traction |
Which issue does this PR close?
Closes #7923 .
Current Pull Request is an Experimental Demo for Validating the Feasibility of Logical Types
Rationale for this change
What changes are included in this PR?
Features
UDTas a function signature inudf/udaf.register_data_typefunction in theSessionContext.New Additions
LogicalTypestruct.ExtensionTypetrait. Abstraction for extension types.TypeSignaturestruct. Uniquely identifies a data type.Major Changes
get_data_type(&self, _name: &TypeSignature) -> Option<LogicalType>function to theContextProvidertrait.DFSchema,DFFieldnow usesLogicalType, removing arrowFieldand retaining onlydata_type,nullable,metadatasincedict_id,dict_is_orderedare not necessary at the logical stage.ExprSchemableandExprSchemanow useLogicalType.astto logical plan conversion now usesLogicalType.To Be Implemented
TypeCoercionRewriterin the analyze stage uses logical types. For example, functions likecomparison_coercion,get_input_types,get_valid_types, etc.udf/udafuseTypeSignatureinstead of the existingDataTypefor ease of use inudf/udaf.To Be Determined
ScalarValueuseLogicalTypeor arrowDataType?LogicalType.DataTypeTableSourcereturnDFSchemaor arrowSchema?Schema.DFSchemaDFSchematoSchema; logical plans useDFSchema, physical plans useSchema).SchemaandDFSchemaSchematoDFSchema?TableScannode, obtain arrowSchemathroughTableSource/TableProviderand then convert it toDFSchema.TableSource/TableProviderreturnsDFSchemainstead ofSchema.DFSchematoSchema?SchemafromTableSourcein physical planner, no need for conversion.DFSchemareturned byTableSourcetoSchemain the physical planner stage.Some Thoughts
dyn ArraytoLineStringArrayorMultiPointArraywas raised. In my perspective, assuming there is a function specifically designed for handlingLineStringdata, the function signature can be defined asLineString, ensuring that the input data must be of a type acceptable byLineStringArray.Are these changes tested?
Are there any user-facing changes?