From 318c3c27e7310aee180f658a04bd93a2f1a8c92e Mon Sep 17 00:00:00 2001
From: David Li
Date: Sun, 28 Jul 2024 19:38:59 -0400
Subject: [PATCH] GH-43453: [Format] Add Opaque canonical extension type
Co-authored-by: Sutou Kouhei
---
docs/source/format/CanonicalExtensions.rst | 110 +++++++++++++++++++++
1 file changed, 110 insertions(+)
diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst
index c258f889dc6..1d86fcf23c4 100644
--- a/docs/source/format/CanonicalExtensions.rst
+++ b/docs/source/format/CanonicalExtensions.rst
@@ -283,6 +283,116 @@ UUID
A specific UUID version is not required or guaranteed. This extension represents
UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way.
+Opaque
+======
+
+Opaque represents a type that an Arrow-based system received from an external
+(often non-Arrow) system, but that it cannot interpret. In this case, it can
+pass on Opaque to its clients to at least show that a field exists and
+preserve metadata about the type from the other system.
+
+Extension parameters:
+
+* Extension name: ``arrow.opaque``.
+
+* The storage type of this extension is any type. If there is no underlying
+ data, the storage type should be Null.
+
+* Extension type parameters:
+
+ * **type_name** = the name of the unknown type in the external system.
+ * **vendor_name** = the name of the external system.
+
+* Description of the serialization:
+
+ A valid JSON object containing the parameters as fields. In the future,
+ additional fields may be added, but all fields current and future are never
+ required to interpret the array.
+
+ Developers **should not** attempt to enable public semantic interoperability
+ of Opaque by canonicalizing specific values of these parameters.
+
+Rationale
+---------
+
+Interfacing with non-Arrow systems requires a way to handle data that doesn't
+have an equivalent Arrow type. In this case, use the Opaque type, which
+explicitly represents an unsupported field. Other solutions are inadequate:
+
+* Raising an error means even one unsupported field makes all operations
+ impossible, even if (for instance) the user is just trying to view a schema.
+* Dropping unsupported columns misleads the user as to the actual schema.
+* An extension type may not exist for the unsupported type.
+* Generating an extension type on the fly would falsely imply support.
+
+Applications **should not** make conventions around vendor_name and type_name.
+These parameters are meant for human end users to understand what type wasn't
+supported. Applications may try to interpret these fields, but must be
+prepared for breakage (e.g., when the type becomes supported with a custom
+extension type later on). Similarly, **Opaque is not a generic container for
+file formats**. Considerations such as MIME types are irrelevant. In both of
+these cases, create a custom extension type instead.
+
+Examples:
+
+* A Flight SQL service that supports connecting external databases may
+ encounter columns with unsupported types in external tables. In this case,
+ it can use the Opaque[Null] type to at least report that a column exists
+ with a particular name and type name. This lets clients know that a column
+ exists, but is not supported. Null is used as the storage type here because
+ only schemas are involved.
+
+ An example of the extension metadata would be::
+
+ {"type_name": "varray", "vendor_name": "Oracle"}
+
+* The ADBC PostgreSQL driver gets results as a series of length-prefixed byte
+ fields. But the driver will not always know how to parse the bytes, as
+ there may be extensions (e.g. PostGIS). It can use Opaque[Binary] to still
+ return those bytes to the application, which may be able to parse the data
+ itself. Opaque differentiates the column from an actual binary column and
+ makes it clear that the value is directly from PostgreSQL. (A custom
+ extension type is preferred, but there will always be extensions that the
+ driver does not know about.)
+
+ An example of the extension metadata would be::
+
+ {"type_name": "geometry", "vendor_name": "PostGIS"}
+
+* The ADBC PostgreSQL driver may also know how to parse the bytes, but not
+ know the intended semantics. For example, `composite types
+ `_ can add new
+ semantics to existing types, somewhat like Arrow extension types. The
+ driver would be able to parse the underlying bytes in this case, but would
+ still use the Opaque type.
+
+ Consider the example in the PostgreSQL documentation of a ``complex`` type.
+ Mapping the type to a plain Arrow ``struct`` type would lose meaning, just
+ like how an Arrow system deciding to treat all extension types by dropping
+ the extension metadata would be undesirable. Instead, the driver can use
+ Opaque[Struct] to pass on the composite type info. (It would be wrong to
+ try to map this to an Arrow-defined complex type: it does not know the
+ proper semantics of a user-defined type, which cannot and should not be
+ hardcoded into the driver in the first place.)
+
+ An example of the extension metadata would be::
+
+ {"type_name": "database_name.schema_name.complex", "vendor_name": "PostgreSQL"}
+
+* The JDBC adapter in the Arrow Java libraries converts JDBC result sets into
+ Arrow arrays, and can get Arrow schemas from result sets. JDBC, however,
+ allows drivers to return `arbitrary Java objects
+ `_.
+
+ The driver can use Opaque[Null] as a placeholder during schema conversion,
+ only erroring if the application tries to fetch the actual data. That way,
+ clients can at least introspect result schemas to decide whether it can
+ proceed to fetch the data, or only query certain columns.
+
+ An example of the extension metadata would be::
+
+ {"type_name": "OTHER", "vendor_name": "JDBC driver name"}
+
=========================
Community Extension Types
=========================