Skip to content

feat: support unenforced primary key concept in schema#4002

Merged
jackye1995 merged 5 commits intolance-format:mainfrom
jackye1995:pk
Jun 18, 2025
Merged

feat: support unenforced primary key concept in schema#4002
jackye1995 merged 5 commits intolance-format:mainfrom
jackye1995:pk

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented Jun 13, 2025

Allow users to set primary key through Arrow schema metadata with primary_key config key. A primary key column must not be nullable. User can configure composite primary key through , delimiter, or using a custom delimiter specified by another primary_key_delim config key.

Closes #4003

@github-actions github-actions Bot added the enhancement New feature or request label Jun 13, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 94.08602% with 11 lines in your changes missing coverage. Please review.

Project coverage is 78.63%. Comparing base (0c44f5a) to head (92938f4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-core/src/datatypes/field.rs 47.36% 9 Missing and 1 partial ⚠️
rust/lance-core/src/datatypes/schema.rs 99.39% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4002      +/-   ##
==========================================
+ Coverage   78.60%   78.63%   +0.02%     
==========================================
  Files         285      285              
  Lines      113006   113198     +192     
  Branches   113006   113198     +192     
==========================================
+ Hits        88833    89014     +181     
- Misses      20736    20746      +10     
- Partials     3437     3438       +1     
Flag Coverage Δ
unittests 78.63% <94.08%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat skeptical of a PR that doesn't actually do anything just because things tend to change by the time it actually gets used. However, the logic seems ok.

Comment thread rust/lance-core/src/datatypes/schema.rs Outdated
Comment on lines +621 to +630
if let Some(pk) = schema.primary_key() {
let mut seen = HashSet::new();
for pk_col in pk {
if seen.contains(&pk_col) {
return Err(Error::Schema {
message: format!(
"Primary key cannot contain multiple copies of the same column: {}",
pk_col
),
location: location!(),
});
}
if let Some(field) = schema.field(&pk_col) {
if field.nullable {
return Err(Error::Schema {
message: format!("Primary key column must not be nullable: {}", pk_col),
location: location!(),
});
}
seen.insert(pk_col);
} else {
return Err(Error::Schema {
message: format!("Primary key column does not exist: {}", pk_col),
location: location!(),
});
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we are just validating. Is the plan to put pk in the lance schema object at some point?

Comment thread rust/lance-core/src/datatypes/schema.rs Outdated
/// using key [`LANCE_SCHEMA_PRIMARY_KEY`] and specify the column name in value.
/// If this is a composite primary key, use `,` to delimit the column name.
/// If you need to override the delimiter, use key [`LANCE_SCHEMA_PRIMARY_KEY_DELIMITER`]
pub fn primary_key(&self) -> Option<Vec<String>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return a vector of fields or field indices?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this question actually clarified a misunderstanding I had... When I originally looked at the code I was wondering why we directly set -1 for all field IDs, so I preserved to store primary keys by name. But after re-reading the code looks like we do reassign the field IDs through schema.set_field_id(None) after initializing it from Arrow schema.

So in that case returning the fields definitely make more sense, and I can convert the primary key information to be stored in the fields so it is resilient against field rename. Let me redo this integration.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is something that is quite confusing and might be nice to clean up at some point.

A lance schema has field ids, an arrow schema does not. When we convert from an arrow schema to a lance schema we assign field ids in DFS order. However, these are not often the correct field ids. The correct field ids are those stored on the dataset (e.g. there might be a gap because a field was added and renamed).

To get a schema with the correct field ids you need to start with the dataset schema (and possibly project from there). So ds.primary_key() should work since it is operating on the dataset schema.

Comment thread rust/lance-core/src/datatypes/schema.rs Outdated
Comment on lines +67 to +70
/// The primary key of the schema is set in the schema metadata
/// using key [`LANCE_SCHEMA_PRIMARY_KEY`] and specify the column name in value.
/// If this is a composite primary key, use `,` to delimit the column name.
/// If you need to override the delimiter, use key [`LANCE_SCHEMA_PRIMARY_KEY_DELIMITER`]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have other questions like "what types of columns can be primary keys? Can the primary key be nullable? Must the primary key be unique? Are the primary key strings case sensitive?" Do you want to put those here or defer them for a future PR when we have something more user facing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the criteria is just not nullable. This is an unenforced primary key, I can make the name reflect that.

Are the primary key strings case sensitive

My understanding is that case sensitivity is a compute level setting rather than storage level. What is currently the behavior of Lance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to put those here or defer them for a future PR when we have something more user facing?

Regarding the user facing experience, my goal is that we have this PR out and then Flink side can directly leverage it. So if you think this is not sufficient then let's iterate here.

My original thinking is that the current change provides a UX that is sufficient because they can set the primary key when they initially load into the dataset:

schema = pa.schema([
  ("name", pa.string()),
  ("age", pa.int32()),
], metadata={"primary_key": "name"})

ds = lance.write_dataset(producer(),
  "./alice_and_bob.lance",
  schema=schema, mode="overwrite")

and then when they want to know what are the primary key columns, they just do

ds.primary_key()

Alternatively, we could also have an additional input for APIs like

ds = lance.write_dataset(producer(),
  "./alice_and_bob.lance",
  schema=schema, mode="overwrite",
  primary_key=["name"])

to externally supply the primary key outside the schema.

My feeling is that primary key is a part of the schema definition, supplying it as a part of the schema feels more natural to me, and since we are using Arrow schema as the main user interaction point, metadata is probably the easiest place to put the information. But I can also see a lot of config options for the write path, so adding it as another input option also seems okay.

Not sure if we have a preference here or any established patterns @westonpace

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that primary key is a part of the schema definition, supplying it as a part of the schema feels more natural to me, and since we are using Arrow schema as the main user interaction point, metadata is probably the easiest place to put the information.

This makes sense. We use the schema to store things like this (e.g. field embeddings) in lancedb as well. I do think the schema is a good place for it. I had thought maybe your plan was "lance schema has primary key as a field (like field id)" and "arrow schema has primary key in metadata". But both of them using metadata works too.

@jackye1995
Copy link
Copy Markdown
Contributor Author

I'm somewhat skeptical of a PR that doesn't actually do anything just because things tend to change by the time it actually gets used. However, the logic seems ok.

Yeah I was originally having everything in my own branch for testing things end to end. #3961 brought up that they need some notion of primary key to develop the Flink integration that's why I separated this out first

@westonpace
Copy link
Copy Markdown
Member

Yeah I was originally having everything in my own branch for testing things end to end. #3961 brought up that they need some notion of primary key to develop the Flink integration that's why I separated this out first

Is this just syntactic sugar for metadata then? Or do you envision lance might use this at some point too?

@jackye1995
Copy link
Copy Markdown
Contributor Author

Is this just syntactic sugar for metadata then? Or do you envision lance might use this at some point too?

For the MemTable work I am also using it. Basically when reading, the data from memtable and on-disk table is merged based on the primary key. That's why I have it in my branch.

@jackye1995
Copy link
Copy Markdown
Contributor Author

@westonpace I did another version of implementation and directly annotate dthe specific field as an enforced primary key. It mostly follows the way you did lance-schema:storage-class, I feel it is a better pattern compared to using the schema metadata. Let me know what you think!

@jackye1995 jackye1995 changed the title feat: support primary key concept in schema feat: support unenforced primary key concept in schema Jun 17, 2025
@jackye1995
Copy link
Copy Markdown
Contributor Author

@westonpace I will merge first so I can rebase and publish the MemTable PR, and we can discuss further over there as a concrete use case and it clears the blocker for the Flink side for now.

@jackye1995 jackye1995 merged commit 84e33eb into lance-format:main Jun 18, 2025
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MemTable] primary key

3 participants