Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion docs/src/format/table/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ a monotonically increasing version number, and an optional reference to the inde

## Schema & Fields

The schema of the table is written as a series of fields, plus a schema metadata map.
The schema of the table is written as a series of fields, plus a schema metadata map.
The data types generally have a 1-1 correspondence with the Apache Arrow data types.
Each field, including nested fields, have a unique integer id. At initial table creation time, fields are assigned ids in depth-first order.
Afterwards, field IDs are assigned incrementally for newly added fields.
Expand All @@ -42,6 +42,31 @@ See [File Format Encoding Specification](../file/encoding.md) for details on ava

</details>

### Unenforced Primary Key
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this! I missed it in the doc refresh


Lance supports defining an unenforced primary key through field metadata.
This is useful for deduplication during merge-insert operations and other use cases that benefit from logical row identity.
The primary key is "unenforced" meaning Lance does not always validate uniqueness constraints.
Users can use specific workloads like merge-insert to enforce it if necessary.
The primary key is fixed after initial setting and must not be updated or removed.

A primary key field must satisfy:
Comment thread
jackye1995 marked this conversation as resolved.

- The field, and all its ancestors, must not be nullable.
- The field must be a leaf field (primitive data type without children).
- The field must not be within a list or map type.

When using an Arrow schema to create a Lance table, add the following metadata to the Arrow field to mark it as part of the primary key:

- `lance-schema:unenforced-primary-key`: Set to `true`, `1`, or `yes` (case-insensitive) to indicate the field is part of the primary key.
- `lance-schema:unenforced-primary-key:position` (optional): A 1-based integer specifying the position within a composite primary key.

For composite primary keys with multiple columns, the position determines the primary key field ordering:

- When positions are specified, fields are ordered by their position values (1, 2, 3, ...).
- When positions are not specified, fields are ordered by their schema field id.
- Fields with explicit positions are ordered before fields without.

## Fragments

![Fragment Structure](../../images/fragment_structure.png)
Expand Down
6 changes: 4 additions & 2 deletions java/lance-jni/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,8 @@ pub fn convert_to_java_field<'local>(
+ "ZLorg/apache/arrow/vector/types/pojo/ArrowType;"
+ "Lorg/apache/arrow/vector/types/pojo/DictionaryEncoding;"
+ "Ljava/util/Map;"
+ "Ljava/util/List;Z)V";
+ "Ljava/util/List;ZI)V";
let pk_position = lance_field.unenforced_primary_key_position.unwrap_or(0) as jint;
let field_obj = env.new_object(
"org/lance/schema/LanceField",
ctor_sig.as_str(),
Expand All @@ -57,7 +58,8 @@ pub fn convert_to_java_field<'local>(
JValue::Object(&JObject::null()),
JValue::Object(&metadata),
JValue::Object(&children),
JValue::Bool(lance_field.unenforced_primary_key as jboolean),
JValue::Bool(lance_field.is_unenforced_primary_key() as jboolean),
JValue::Int(pk_position),
],
)?;

Expand Down
19 changes: 18 additions & 1 deletion java/src/main/java/org/lance/schema/LanceField.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.OptionalInt;
import java.util.stream.Collectors;

public class LanceField {
Expand All @@ -34,6 +35,7 @@ public class LanceField {
private final Map<String, String> metadata;
private final List<LanceField> children;
private final boolean isUnenforcedPrimaryKey;
private final int unenforcedPrimaryKeyPosition;

LanceField(
int id,
Expand All @@ -44,7 +46,8 @@ public class LanceField {
DictionaryEncoding dictionaryEncoding,
Map<String, String> metadata,
List<LanceField> children,
boolean isUnenforcedPrimaryKey) {
boolean isUnenforcedPrimaryKey,
int unenforcedPrimaryKeyPosition) {
this.id = id;
this.parentId = parentId;
this.name = name;
Expand All @@ -54,6 +57,7 @@ public class LanceField {
this.metadata = metadata;
this.children = children;
this.isUnenforcedPrimaryKey = isUnenforcedPrimaryKey;
this.unenforcedPrimaryKeyPosition = unenforcedPrimaryKeyPosition;
}

public int getId() {
Expand Down Expand Up @@ -92,6 +96,18 @@ public boolean isUnenforcedPrimaryKey() {
return isUnenforcedPrimaryKey;
}

/**
* Get the position of this field within a composite primary key.
*
* @return the 1-based position if explicitly set, or empty if using schema field id ordering
*/
public OptionalInt getUnenforcedPrimaryKeyPosition() {
if (unenforcedPrimaryKeyPosition > 0) {
return OptionalInt.of(unenforcedPrimaryKeyPosition);
}
return OptionalInt.empty();
}

public Field asArrowField() {
List<Field> arrowChildren =
children.stream().map(LanceField::asArrowField).collect(Collectors.toList());
Expand All @@ -110,6 +126,7 @@ public String toString() {
.add("dictionaryEncoding", dictionaryEncoding)
.add("children", children)
.add("isUnenforcedPrimaryKey", isUnenforcedPrimaryKey)
.add("unenforcedPrimaryKeyPosition", unenforcedPrimaryKeyPosition)
.add("metadata", metadata)
.toString();
}
Expand Down
5 changes: 5 additions & 0 deletions protos/file.proto
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,11 @@ message Field {

bool unenforced_primary_key = 12;

// Position of this field in the primary key (1-based).
// 0 means the field is part of the primary key but uses schema field id for ordering.
// When set to a positive value, primary key fields are ordered by this position.
uint32 unenforced_primary_key_position = 13;

// DEPRECATED ----------------------------------------------------------------

// Deprecated: Only used in V1 file format. V2 uses variable encodings defined
Expand Down
4 changes: 3 additions & 1 deletion python/python/lance/lance/schema.pyi
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The Lance Authors

from typing import Any, Dict, List
from typing import Any, Dict, List, Optional

import pyarrow as pa

class LanceField:
def name(self) -> str: ...
def id(self) -> int: ...
def children(self) -> List[LanceField]: ...
def is_unenforced_primary_key(self) -> bool: ...
def unenforced_primary_key_position(self) -> Optional[int]: ...

class LanceSchema:
def fields(self) -> List[LanceField]: ...
Expand Down
15 changes: 15 additions & 0 deletions python/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,21 @@ impl LanceField {
Ok(self.0.metadata.clone())
}

/// Check if this field is part of an unenforced primary key.
pub fn is_unenforced_primary_key(&self) -> bool {
self.0.is_unenforced_primary_key()
}

/// Get the position of this field within a composite primary key.
///
/// Returns the 1-based position if explicitly set, or None if not part of
/// a primary key or using schema field id ordering.
pub fn unenforced_primary_key_position(&self) -> Option<u32> {
self.0
.unenforced_primary_key_position
.filter(|&pos| pos > 0)
}

pub fn to_arrow(&self) -> PyArrowType<arrow_schema::Field> {
PyArrowType((&self.0).into())
}
Expand Down
40 changes: 31 additions & 9 deletions rust/lance-core/src/datatypes/field.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,13 @@ use crate::{
/// (3) The field must not be within a list type.
pub const LANCE_UNENFORCED_PRIMARY_KEY: &str = "lance-schema:unenforced-primary-key";

/// Use this config key in Arrow field metadata to specify the position of a primary key column.
/// The value is a 1-based integer indicating the order within the composite primary key.
/// When specified, primary key fields are ordered by this position value.
/// When not specified, primary key fields are ordered by their schema field id.
pub const LANCE_UNENFORCED_PRIMARY_KEY_POSITION: &str =
"lance-schema:unenforced-primary-key:position";

fn has_blob_v2_extension(field: &ArrowField) -> bool {
field
.metadata()
Expand Down Expand Up @@ -148,7 +155,11 @@ pub struct Field {

/// Dictionary value array if this field is dictionary.
pub dictionary: Option<Dictionary>,
pub unenforced_primary_key: bool,

/// Position of this field in the primary key (1-based).
/// None means the field is not part of the primary key.
/// Some(n) means this field is the nth column in the primary key.
pub unenforced_primary_key_position: Option<u32>,
}

impl Field {
Expand Down Expand Up @@ -574,7 +585,7 @@ impl Field {
nullable: self.nullable,
children: vec![],
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_position: self.unenforced_primary_key_position,
};
if path_components.is_empty() {
// Project stops here, copy all the remaining children.
Expand Down Expand Up @@ -845,7 +856,7 @@ impl Field {
nullable: self.nullable,
children,
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_position: self.unenforced_primary_key_position,
};
return Ok(f);
}
Expand Down Expand Up @@ -908,7 +919,7 @@ impl Field {
nullable: self.nullable,
children,
dictionary: self.dictionary.clone(),
unenforced_primary_key: self.unenforced_primary_key,
unenforced_primary_key_position: self.unenforced_primary_key_position,
})
}
}
Expand Down Expand Up @@ -1038,6 +1049,11 @@ impl Field {
pub fn is_leaf(&self) -> bool {
self.children.is_empty()
}

/// Return true if the field is part of the (unenforced) primary key.
pub fn is_unenforced_primary_key(&self) -> bool {
self.unenforced_primary_key_position.is_some()
}
}

impl fmt::Display for Field {
Expand Down Expand Up @@ -1114,10 +1130,16 @@ impl TryFrom<&ArrowField> for Field {
}
_ => vec![],
};
let unenforced_primary_key = metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY)
.map(|s| matches!(s.to_lowercase().as_str(), "true" | "1" | "yes"))
.unwrap_or(false);
let unenforced_primary_key_position = metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY_POSITION)
.and_then(|s| s.parse::<u32>().ok())
.or_else(|| {
// Backward compatibility: use 0 for legacy boolean flag
metadata
.get(LANCE_UNENFORCED_PRIMARY_KEY)
.filter(|s| matches!(s.to_lowercase().as_str(), "true" | "1" | "yes"))
.map(|_| 0)
});
let is_blob_v2 = has_blob_v2_extension(field);

if is_blob_v2 {
Expand Down Expand Up @@ -1154,7 +1176,7 @@ impl TryFrom<&ArrowField> for Field {
nullable: field.is_nullable(),
children,
dictionary: None,
unenforced_primary_key,
unenforced_primary_key_position,
})
}
}
Expand Down
Loading
Loading