Skip to content

[Proposal] Decouple logical from physical types #11513

@notfilippo

Description

@notfilippo

Abstract

Logical types are an abstract representation of data that emphasises structure without regard for physical implementation, while physical types are tangible representations that describe how data will be stored, accessed, and retrieved.

Currently the type model in DataFusion, both in logical and physical plans, relies purely on arrow’s DataType enum. But whilst this choice makes sense its physical execution engine (DataTypes map 1:1 with the physical array representation, defining implicit rules for storage, access, and retrieval), it has flaws when dealing with its logical plans. This is due to the fact that some logically equivalent array types in the Arrow format have very different DataTypes – for example a logical string array can be represented in the Arrow array of DataType Utf8, Dictionary(Utf8), RunEndEncoded(Utf8), and StringView (without mentioning the different indexes types that can be specified for dictionaries and REE arrays).

This proposal evaluates possible solutions for decoupling the notion of types in DataFusion’s logical plans with DataTypes, evaluating their impact on DataFusion itself and on downstream crates.

Goals

  • Introduce a subset of DataTypes to be used as logical types
  • Define the boundaries of logical types and physical types in the planning and execution path
  • Define a solution to keep track of a column’s logical and physical type
  • Introduce a new logical ScalarValue
  • Define how to specify materialisation options for physical types
  • Evaluate the impact on downstream crates

Proposal

Defining a logical type

To define the list of logical types we must first take a look at the physical representation of the engine: the Arrow columnar format. DataTypes are the physical types of the DataFusion engine and they define storage and access pattern for buffers in the Arrow format.

Looking at a list of the possible DataTypes it's clear that while some map 1:1 with their logical representation other also specify information about the encoding (e.g. Large*, FixedSize*, Dictionary , RunEndEncoded...). The latter must be consolidate into what they represent, discarding the encoding information and, in general, types that can store different ranges of values should be different logical types. (ref).

What follows is a list of DataTypes and how would they map to their respective logical type following the rules above:

DataType (aka physical type) Logical type Note
Null Null
Boolean Boolean
Int8 Int8
Int16 Int16
Int32 Int32
Int64 Int64
UInt8 UInt8
UInt16 Uint16
UInt32 UInt32
UInt64 UInt64
Float16 Float16
Float32 Float32
Float64 Float64
Timestamp(unit, tz) Timestamp(unit, tz)
Date32 Date
Date64 Date Date64 doesn't actually provide more precision. (docs)
Time32(unit) Time32(unit)
Time64(unit) Time64(unit)
Duration(unit) Duration(uint)
Interval(unit) Interval(unit)
Binary Binary
FixedSizeBinary(size) Binary
LargeBinary Binary
BinaryView Binary
Utf8 Utf8
LargeUtf8 Utf8
Utf8View Utf8
List(field) List(field)
ListView(field) List(field)
FixedSizeList(field, size) List(field)
LargeList(field) List(field)
LargeListView(field) List(field)
Struct(fields) Struct(fields)
Union(fields, mode) Union(fields)
Dictionary(index_type, data_type) underlying data_type, converted to logical type
Decimal128(precision, scale) Decimal128(precision, scale)
Decimal256(precision, scale) Decimal256(precision, scale)
Map(fields, sorted) Map(fields, sorted)
RunEndEncoded(run_ends_type, data_type) underlying data_type, converted to logical type

User defined types

User defined physical types

The Arrow columnar format provides guidelines to define Extension types though the composition of native DataTypes and custom metadata in fields. Since this proposal already includes a mapping from DataType to logical type we could extend it to support user defined types (through extension types) which would map to a known logical type.

For example an extension type with the DataType of List(UInt8) and a custom metadata field {'ARROW:extension:name': 'myorg.mystring'} could have a logical type of Utf8.

User defined logical types

Arrow extension types can also be used to extend the list of supported logical types. An additional logical type called Extension could be introduced. This extension type would contain a structure detailing its logical type and the extension type metadata.

Boundaries of logical and physical types

In plans and expressions

As the prefix suggests, logical types should be used exclusively in logical plans (LogicalPlan and Expr) while physical types should be used exclusively in physical plans (ExecutionPlan and PhysicalExpr). This would enable logical plans to be purely logical, without worrying about underlying encodings.

Expr in logical plans would need to represent their resulting value as logical types through the trait method ExprSchemable::get_type, which would need to return a logical type instead.

In functions

ScalarUDF, WindowUDF, and AggregateUDF all define their Signatures through the use of DataTypes. Function arguments are currently validated against signatures through type coercion during logical planning. With logical types Signatures would be expressed without the need to specify the underlying encoding. This would simplify the type coercion rules, removing the need of traversing dictionaries and handling different containers and focusing instead on explicit logical rules (e.g. all logical types can be coerced to Utf8).

During execution the function receives a slice of ColumnarValue that is guaranteed to match the signature. Being strictly a physical operation, the function will have to deal with physical types. ColumnarValue enum could be extended so that functions could choose to provide their own optimised implementation for a subset of physical types and then fall back to a generic implementation that materialises the argument to known physical type. This would potentially allow native functions to support user defined physical types that map to known logical types.

In substrait

The datafusion_substrait crate provides helper functions to enable interoperability between substrait plans and datafusion's plan. While some effort has been made to support converting from / to DataTypes via type_variation_reference (example here), dictionaries and not supported as both literal types and cast types, leading to potential errors when trying to encode a valid LogicalPlan) into a substrait plan. The usage of logical types would enable a more seamless transition between DataFusion's native logical plan and substrait.

Keeping track of the physical type

While logical types simplify the list of possible types that can be handled during logical planning, the relation to their underlying physical representation needs to be accounted for when transforming the logical plan into nested ExecutionPlan and PhysicalExpr which will dictate how will the query execute.

This proposal introduces a new trait that represents the link between a logical type and its underlying physical representation:

pub enum LogicalType {/*...*/}

pub type TypeRelationRef = Arc<dyn TypeRelation + Send + Sync>;

pub trait TypeRelation: std::fmt::Debug {
    fn logical(&self) -> &LogicalType;
    fn physical(&self) -> &DataType;
    // ...
}

#[derive(Clone, Debug)]
pub struct NativeType {
    logical: LogicalType,
    physical: DataType,
}

impl TypeRelation for NativeType {/*...*/}
impl From<DataType> for NativeType {/*...*/}
impl From<DataType> for LogicalType {/*...*/}


#[derive(Clone, Debug)]
pub struct LogicalPhysicalType(TypeRelationRef);

impl TypeRelation for LogicalPhysicalType {/*...*/}

While NativeType would be primarily used for standard DataTypes and their logical relation, TypeRelation is defined to provide support for used defined physical types.

What follows is an exploration of the areas in which LogicalPhysicalType would need to get introduced:

A new type of Schema

To support the creation of LogicalPhysicalType a new schema must be introduced, which can be consumed as either a logical schema or used to access the underlying physical representation. Currently DFSchema is used throughout DataFusion as a thin wrapper for Arrow's native Schema in order to qualify fields originating from potentially different tables. This proposal suggest to decouple the DFSchema from its underlying Schema and instead adopt a new Schema-compatible structure (LogicalPhysicalSchema) but with DataTypes replaced by LogicalPhysicalType. This would also mean the introduction of new Field-compatible structure (LogicalPhysicalField) which also supports LogicalPhysicalType instead of Arrow's native Field DataType.

DFSchema would be used by DataFusion's planning and execution engine to derive either logical or physical type information of each field. It should retain the current interoperability with Schema (and additionally the new LogicalPhysicalSchema) allowing easy Into & From conversion.

Type sources

Types in plans sourced through Arrow's native Schema returned by implementations of TableSource / TableProvider , variables DataTypes returned by VarProvider , and ScalarValue. To allow definition of custom LogicalPhysicalType these type sources should be edited to return LogicalPhysicalSchema / LogicalPhysicalType.

Tables

For tables a non-breaking way of editing the trait to support LogicalPhysicalSchema could be:

  1. Declare a new trait method logical_physical_schema() -> LogicalPhysicalSchema, this method's default implementation calls the schema() and converts it to LogicalPhysicalSchema without introducing any custom LogicalPhysicalType. Implementers are free to override this method and add custom LogicalPhysicalType.
  2. Declare the existing schema() method to return impl Into<LogicalPhysicalSchema>.

Open question: Should we really introduce a new schema type or should we reuse DFSchema? The qualifiers for fields in DFSchema should not be specified by the Tables themselves.

VarProvider

VarProvider needs to be edited in order to return a LogicalPhysicalType when getting the type of the variable, while the actual variable can very well remain a ScalarValue.

ScalarValue

ScalarValue should be wrapped in order to have a way of retrieving both its logical and its underlying physical type. When reasoning about logical plans it should be treated as its logical type while its physical properties should be accessed exclusively by the physical plan.

Open question: Should ScalarValue split into a LogicalScalarValue and a PhysicalScalarValue? Or should it just provide generic information during logical planning (e.g. its logical type, is_null(), is_zero()) without access the underlying physical representation?

Physical execution

For physical planning and execution, much like the invocation of UDFs, ExecutionPlan and PhysicalExpr must also be granted access to the LogicalPhysicalType in order to have the capabilities of performing optimised execution for a subset of supported physical types and then fall back to a generic implementation that materialises other types to known physical type. This can be achieved by substituting the use DataType and Schema with, respectively, LogicalPhysicalType and LogicalPhysicalSchema.

Impact on downstream dependencies

Care must be put in place not to introduce breaking changes for downstream crates and dependencies that build on top of DataFusion.

The most impactful changes introduced by this proposal are the LogicalPhysicalType, LogicalPhysicalSchema and LogicalPhysicalField types. These structures would replace most of the mentions of DataType, Schema and Field in the DataFusion codebase. Type sources (TableProvider / TableSource, VarProvider, and ScalarValue) and Logical / ExecutionPlan nodes would be greatly affected by this change. This effect can be mitigated by providing good Into & From implementations for the new types and providing editing existing function arguments and return types as impl Into<LogicalPhysical*>, but it will still break a lot of things.

Case study: datafusion-comet

datafusion-comet is a high-performance accelerator for Apache Spark, built on top of DataFusion. A fork of the project containing changes from this proposal currently compiles without modifications. As more features in this proposal are implemented, namely UDFs Logical Signature, some refactoring might be required (e.g for CometScalarFunction and other functions defined in the codebase). Refer to the draft's TODOs to see what's missing.

Non-goals topics that might be interesting to dive into

While not the primary goals of this proposal, here are some interesting topics that could be explored in the future:

RecordBatches with same logical type but different physical types

Integrating LogicalPhysicalSchema into DataFusion's RecordBatches, streamed from one ExecutionPlan to the other, could be an interesting approach to support the possibility of two record batches having logically equivalent schemas with different underlying physical types. This could be useful in situations where data, stored in multiple pages mapping 1:1 with RecordBatches, is encoded with different strategies based on the density of the rows and the cardinality of the values.

Current work effort

The [Epic] issue tracking this effort can be found at #12622 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions