docs: add array type support#5884
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive documentation for Arrow data types support in Lance, with a strong emphasis on array types for vector embeddings. The documentation addresses a gap identified by the community regarding array/vector type support, particularly for integrations with systems like Apache Fluss.
Changes:
- Added new
docs/src/guide/data_types.mdcovering the full Arrow type system, array types for vectors, and data type mapping tables for integrations - Updated
docs/src/guide/.pagesto include the new Data Types page in the documentation navigation - Provided Python and Rust code examples for vector embeddings, vector search, and complex types
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| docs/src/guide/data_types.md | New comprehensive guide covering Arrow type system, FixedSizeList for vector embeddings, variable-length arrays, nested types, integration type mappings, and best practices for vector data |
| docs/src/guide/.pages | Added "Data Types" entry to navigation menu between "Read and Write" and "Data Evolution" |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | `UInt8`, `UInt16`, `UInt32`, `UInt64` | Unsigned integers | IDs, indices | | ||
| | `Float16`, `Float32`, `Float64` | Floating point numbers | Measurements, scores | | ||
| | `Decimal128`, `Decimal256` | Fixed-precision decimals | Financial data | | ||
| | `Date32`, `Date64` | Date values | Timestamps | |
There was a problem hiding this comment.
The "Example Use Case" column lists "Timestamps" for Date types, which is misleading. Date types (Date32, Date64) store only dates without time information, while the Timestamp type (listed separately on line 20) stores both date and time. Consider changing the example use case for Date types to something like "Birth dates, event dates" to clearly differentiate from timestamps.
| | `Date32`, `Date64` | Date values | Timestamps | | |
| | `Date32`, `Date64` | Date values | Birth dates, event dates | |
| | `DOUBLE` | `Float64` | | | ||
| | `DECIMAL(p,s)` | `Decimal128(p,s)` | | | ||
| | `STRING` / `VARCHAR` | `Utf8` | | | ||
| | `CHAR(n)` | `Utf8` | Fixed-width string | |
There was a problem hiding this comment.
The mapping from CHAR(n) to Utf8 with the note "Fixed-width string" may be misleading. Arrow's Utf8 type is a variable-length string type, not fixed-width. When SQL CHAR(n) types (which are padded to a fixed width) are converted to Arrow/Lance, they become variable-length Utf8 strings. Consider clarifying the note to say "Fixed-width in source system" or "Converted from fixed-width string" to avoid confusion about Arrow's representation.
| | `CHAR(n)` | `Utf8` | Fixed-width string | | |
| | `CHAR(n)` | `Utf8` | Fixed-width in source system; stored as variable-length Utf8 | |
yanghua
left a comment
There was a problem hiding this comment.
Left two comments, otherwise LGTM.
|
|
||
| `FixedSizeList` is the recommended type for storing fixed-dimensional vector embeddings. Each vector has the same number of dimensions, making it highly efficient for storage and computation. | ||
|
|
||
| === "Python" |
There was a problem hiding this comment.
Is it necessary? What would the === be rendered to?
| print(f"Created dataset with {ds.count_rows()} rows") | ||
| ``` | ||
|
|
||
| === "Rust" |

Summary
Add comprehensive documentation for Arrow data types support in Lance, with a focus on array types for vector embeddings - the most important use case for Lance integration.
This PR addresses the feedback from the community that the Lance documentation was missing array/vector type support documentation. See: https://fluss.apache.org/docs/next/streaming-lakehouse/integrate-data-lakes/lance/
Changes
Add new
docs/src/guide/data_types.mdcovering:FixedSizeList,List,LargeList)Update
docs/src/guide/.pagesto include the new Data Types page in navigationWhy This Matters
Vector embeddings are the primary use case for Lance integration with streaming systems like Apache Fluss. The existing documentation lacked:
ARRAY<FLOAT>(n)/FixedSizeListfor vectorsThis documentation fills that gap and helps users understand how to properly store and query vector embeddings in Lance.