Skip to content

docs: add schema data types and field IDs documentation#5925

Merged
wjones127 merged 5 commits intolance-format:mainfrom
dik654:docs/schema-update
Feb 17, 2026
Merged

docs: add schema data types and field IDs documentation#5925
wjones127 merged 5 commits intolance-format:mainfrom
dik654:docs/schema-update

Conversation

@dik654
Copy link
Copy Markdown
Contributor

@dik654 dik654 commented Feb 10, 2026

This PR documents the supported data types in Lance schemas and field ID semantics.

Changes

  • Add comprehensive data types reference with Arrow type mappings
  • Document field ID assignment and properties
  • Include examples of field ID usage in nested structures

How to use

Users can now refer to the schema documentation for understanding data type representations and field ID behavior.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Feb 10, 2026
@dik654 dik654 force-pushed the docs/schema-update branch from ae6b557 to c7ca59d Compare February 10, 2026 04:37
@wjones127 wjones127 self-assigned this Feb 10, 2026
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a few minor suggestions, but otherwise this looks pretty good. Thank you for working on this!

Comment thread docs/src/format/table/schema.md Outdated
Field IDs can be used in several contexts:

1. **Data File References**: Specify which columns are present in each data file
2. **Deletion Tracking**: Reference specific columns when applying deletions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more what this means? I'm not sure what it is talking about.

- **Stable**: IDs are preserved across schema evolution operations
- **Sparse**: Field IDs may not form a contiguous sequence after schema evolution

### Using Field IDs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just replace this section with the sentence "When referencing fields internally within the format, use the field ids rather than field names or positions."

Comment thread docs/src/format/table/schema.md Outdated
- **Drop Column**: Remove field from schema; its ID may be reused in some systems
- **Rename Column**: Change field name; ID remains the same
- **Reorder Columns**: Change field order in schema; IDs remain the same
- **Type Evolution**: Subject to compatibility rules defined by Apache Arrow
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Type Evolution**: Subject to compatibility rules defined by Apache Arrow
- **Type Evolution**: Data type can be changed. This might require rewriting the column in the data, depending on how the type was changed.

dik654 and others added 5 commits February 17, 2026 16:27
Add comprehensive documentation of Lance schema format including:
- Complete reference of all supported data types and their string representations
- Mapping between logical types and Apache Arrow types
- Field ID assignment and evolution semantics
- Field metadata configuration options
- Schema examples for common use cases

This resolves the gap identified in issue lance-format#5707 by providing detailed specification
of what data types are supported and how they map to Arrow types.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Complete the schema.md documentation with:
- Primary Key Metadata section that links to index.md
- All Arrow types properly formatted with backticks for consistency
- Note section referencing discussions lance-format#5864 and lance-format#5817 on logical type simplification
- Comprehensive coverage of data types, field IDs, metadata, and examples

This resolves lance-format#5707 by providing a complete specification of the schema format
including supported data types (with Arrow type mappings), field ID system,
field metadata configuration, and practical examples.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…tation

Cross-check against actual code revealed discrepancies in protobuf Field definition:

1. Corrected Field message documentation:
   - Separated `type` (enum: PARENT/REPEATED/LEAF) from `logical_type` (string)
   - Added missing `parent_id` field for nested field relationships
   - Added `unenforced_primary_key` and `unenforced_primary_key_position` fields
   - Corrected metadata type from map<string, string> to map<string, bytes>

2. Enhanced nested field explanation:
   - Clarified how parent_id links child fields to parent
   - Updated field ID assignment example to show parent_id relationships
   - Added note about parent_id=0 for top-level fields

3. Updated Primary Key Metadata section:
   - Changed from metadata reference to direct protobuf field documentation
   - Documented both unenforced_primary_key and position fields

4. Improved examples:
   - Updated all example schemas to use logical_type
   - Changed Primary Key example to use protobuf fields instead of metadata
   - Updated nested structure example to show parent_id relationships
   - Added clarifying note about simplified representation vs actual protobuf format

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Critical fix discovered during comprehensive code validation:

Corrected parent_id value for top-level fields from 0 to -1:
- Top-level fields (no parent) have parent_id: -1
- Nested fields have parent_id: <parent_field_id>

This matches the actual implementation in:
- lance-file/src/datatypes.rs: if f.parent_id == -1 { ... }
- Field deserialization logic uses -1 to detect top-level fields

Updated:
1. Field ID assignment examples (all top-level: parent_id: -1)
2. All example schemas (Simple Table, Nested Structure, Vector Embeddings)
3. Protobuf Field definition documentation
4. Field ID Assignment section explanation
5. Note about children vector in Rust in-memory representation

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@dik654 dik654 force-pushed the docs/schema-update branch from c7ca59d to bc6dbd1 Compare February 17, 2026 07:28
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now. Will merge once CI is finished. Thanks for working on this! 😄

@wjones127 wjones127 merged commit 5b8744d into lance-format:main Feb 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants