Skip to content

[Format][Docs] Clarify (remove?) usage of the term "logical types" #41691

@jorisvandenbossche

Description

@jorisvandenbossche

In several places in the Arrow specification and documentation we use the term "logical types", although we don't use it consistently and we don't actually have physical types (only physical layouts) to contrast it with.

Current usage

The Columnar Format doc page has a section called "Logical Types" (https://arrow.apache.org/docs/15.0/format/Columnar.html#logical-types) to contrast those types from the physical layouts:

The Schema.fbs defines built-in logical types supported by the Arrow columnar format. Each logical type uses one of the above physical layouts. Nested logical types may have different physical layouts depending on the particular realization of the type.

It explains an Array as having a logical data type, where "Each logical data type has a well-defined physical layout."

The authoritative Schema.fbs also uses the term:

/// Logical types, vector layouts, and schemas

although it uses the term also in a "correct" way (but incorrect in the way we define the term currently):

arrow/format/Schema.fbs

Lines 101 to 105 in 07a30d9

/// Represents the same logical types that List can, but contains offsets and
/// sizes allowing for writes in any order and sharing of child values among
/// list values.
table ListView {
}

The Python docs (https://arrow.apache.org/docs/15.0/python/data.html#type-metadata):

We use the name logical type because the physical storage may be the same for one or more types. For example, int64, float64, and timestamp[ms] all occupy 64 bits per value.

Further, in various implementations the term is obviously used as well.

In the Terminology section of the Columnar Format docs (https://arrow.apache.org/docs/15.0/format/Columnar.html#terminology), we define it as:

Logical type: An application-facing semantic value type that is implemented using some physical layout. For example, Decimal values are stored as 16 bytes in a fixed-size binary layout. Similarly, strings can be stored as List<1-byte>. A timestamp may be stored as 64-bit fixed-size layout.

which is mostly correct with our current usage ("using some physical layout"), but it is also confusing that it explains strings as List<1-byte> as we have a different physical layout used for strings

Previous discussion

Generally we use the term relatively consistently to contrast "logical types" from the "physical layouts", but confusion around the terminology has come up regularly (what are "physical types" then? And extension types are essentially "logical types", but annotating our own logical types). This was specifically discussed in #14752.

@amoeba proposed (#14752 (comment)):

Still some discussion to be had about avoiding "logical" vs. "physical" in favor of "types" and "layouts" and possibly updating the format docs comprehensively

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions