Skip to content

docs: add array type support#5884

Merged
yanghua merged 5 commits intolance-format:mainfrom
XuQianJin-Stars:docs/add-array-type-support
Feb 5, 2026
Merged

docs: add array type support#5884
yanghua merged 5 commits intolance-format:mainfrom
XuQianJin-Stars:docs/add-array-type-support

Conversation

@XuQianJin-Stars
Copy link
Copy Markdown
Contributor

Summary

Add comprehensive documentation for Arrow data types support in Lance, with a focus on array types for vector embeddings - the most important use case for Lance integration.

This PR addresses the feedback from the community that the Lance documentation was missing array/vector type support documentation. See: https://fluss.apache.org/docs/next/streaming-lakehouse/integrate-data-lakes/lance/

Changes

  • Add new docs/src/guide/data_types.md covering:

    • Complete Arrow type system overview (primitive, string, binary types)
    • Array types for vector embeddings (FixedSizeList, List, LargeList)
    • Python and Rust code examples for creating and using vector embeddings
    • Vector search examples with index creation
    • Nested and complex types (Struct, Map)
    • Data type mapping table for integrations with Flink, Spark, Presto, etc.
    • Best practices for vector data storage and retrieval
  • Update docs/src/guide/.pages to include the new Data Types page in navigation

Why This Matters

Vector embeddings are the primary use case for Lance integration with streaming systems like Apache Fluss. The existing documentation lacked:

  1. Clear explanation of how to use ARRAY<FLOAT>(n) / FixedSizeList for vectors
  2. Type mapping table showing array type support
  3. Practical examples for ML/AI workflows

This documentation fills that gap and helps users understand how to properly store and query vector embeddings in Lance.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Feb 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 4, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@prrao87 prrao87 changed the title docs:add array type support docs: add array type support Feb 5, 2026
@prrao87 prrao87 requested a review from Copilot February 5, 2026 01:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for Arrow data types support in Lance, with a strong emphasis on array types for vector embeddings. The documentation addresses a gap identified by the community regarding array/vector type support, particularly for integrations with systems like Apache Fluss.

Changes:

  • Added new docs/src/guide/data_types.md covering the full Arrow type system, array types for vectors, and data type mapping tables for integrations
  • Updated docs/src/guide/.pages to include the new Data Types page in the documentation navigation
  • Provided Python and Rust code examples for vector embeddings, vector search, and complex types

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
docs/src/guide/data_types.md New comprehensive guide covering Arrow type system, FixedSizeList for vector embeddings, variable-length arrays, nested types, integration type mappings, and best practices for vector data
docs/src/guide/.pages Added "Data Types" entry to navigation menu between "Read and Write" and "Data Evolution"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/src/guide/data_types.md Outdated
| `UInt8`, `UInt16`, `UInt32`, `UInt64` | Unsigned integers | IDs, indices |
| `Float16`, `Float32`, `Float64` | Floating point numbers | Measurements, scores |
| `Decimal128`, `Decimal256` | Fixed-precision decimals | Financial data |
| `Date32`, `Date64` | Date values | Timestamps |
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Example Use Case" column lists "Timestamps" for Date types, which is misleading. Date types (Date32, Date64) store only dates without time information, while the Timestamp type (listed separately on line 20) stores both date and time. Consider changing the example use case for Date types to something like "Birth dates, event dates" to clearly differentiate from timestamps.

Suggested change
| `Date32`, `Date64` | Date values | Timestamps |
| `Date32`, `Date64` | Date values | Birth dates, event dates |

Copilot uses AI. Check for mistakes.
Comment thread docs/src/guide/data_types.md Outdated
| `DOUBLE` | `Float64` | |
| `DECIMAL(p,s)` | `Decimal128(p,s)` | |
| `STRING` / `VARCHAR` | `Utf8` | |
| `CHAR(n)` | `Utf8` | Fixed-width string |
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping from CHAR(n) to Utf8 with the note "Fixed-width string" may be misleading. Arrow's Utf8 type is a variable-length string type, not fixed-width. When SQL CHAR(n) types (which are padded to a fixed width) are converted to Arrow/Lance, they become variable-length Utf8 strings. Consider clarifying the note to say "Fixed-width in source system" or "Converted from fixed-width string" to avoid confusion about Arrow's representation.

Suggested change
| `CHAR(n)` | `Utf8` | Fixed-width string |
| `CHAR(n)` | `Utf8` | Fixed-width in source system; stored as variable-length Utf8 |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two comments, otherwise LGTM.


`FixedSizeList` is the recommended type for storing fixed-dimensional vector embeddings. Each vector has the same number of dimensions, making it highly efficient for storage and computation.

=== "Python"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary? What would the === be rendered to?

print(f"Created dataset with {ds.count_rows()} rows")
```

=== "Rust"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

In duckdb.md
=== "SQL"

```sql
INSTALL lance FROM community;
LOAD lance;
```

=== "Python"

```python
import duckdb

duckdb.sql(
    """
    INSTALL lance FROM community;
    LOAD lance;
    """
)
```
Clipboard_Screenshot_1770277712

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get.

Copy link
Copy Markdown
Collaborator

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks.

@yanghua yanghua merged commit a286e4b into lance-format:main Feb 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants