feat: support unenforced primary key concept in schema by jackye1995 · Pull Request #4002 · lance-format/lance

jackye1995 · 2025-06-13T07:02:26Z

Allow users to set primary key through Arrow schema metadata with primary_key config key. A primary key column must not be nullable. User can configure composite primary key through , delimiter, or using a custom delimiter specified by another primary_key_delim config key.

Closes #4003

codecov-commenter · 2025-06-13T07:41:31Z

Codecov Report

Attention: Patch coverage is 94.08602% with 11 lines in your changes missing coverage. Please review.

Project coverage is 78.63%. Comparing base (0c44f5a) to head (92938f4).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-core/src/datatypes/field.rs	47.36%	9 Missing and 1 partial ⚠️
rust/lance-core/src/datatypes/schema.rs	99.39%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4002      +/-   ##
==========================================
+ Coverage   78.60%   78.63%   +0.02%     
==========================================
  Files         285      285              
  Lines      113006   113198     +192     
  Branches   113006   113198     +192     
==========================================
+ Hits        88833    89014     +181     
- Misses      20736    20746      +10     
- Partials     3437     3438       +1

Flag	Coverage Δ
unittests	`78.63% <94.08%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

westonpace

I'm somewhat skeptical of a PR that doesn't actually do anything just because things tend to change by the time it actually gets used. However, the logic seems ok.

westonpace · 2025-06-16T16:40:07Z

+        if let Some(pk) = schema.primary_key() {
+            let mut seen = HashSet::new();
+            for pk_col in pk {
+                if seen.contains(&pk_col) {
+                    return Err(Error::Schema {
+                        message: format!(
+                            "Primary key cannot contain multiple copies of the same column: {}",
+                            pk_col
+                        ),
+                        location: location!(),
+                    });
+                }
+                if let Some(field) = schema.field(&pk_col) {
+                    if field.nullable {
+                        return Err(Error::Schema {
+                            message: format!("Primary key column must not be nullable: {}", pk_col),
+                            location: location!(),
+                        });
+                    }
+                    seen.insert(pk_col);
+                } else {
+                    return Err(Error::Schema {
+                        message: format!("Primary key column does not exist: {}", pk_col),
+                        location: location!(),
+                    });
+                }
+            }
+        }


Right now we are just validating. Is the plan to put pk in the lance schema object at some point?

westonpace · 2025-06-16T16:40:12Z

+    /// using key [`LANCE_SCHEMA_PRIMARY_KEY`] and specify the column name in value.
+    /// If this is a composite primary key, use `,` to delimit the column name.
+    /// If you need to override the delimiter, use key [`LANCE_SCHEMA_PRIMARY_KEY_DELIMITER`]
+    pub fn primary_key(&self) -> Option<Vec<String>> {


Should this return a vector of fields or field indices?

Okay this question actually clarified a misunderstanding I had... When I originally looked at the code I was wondering why we directly set -1 for all field IDs, so I preserved to store primary keys by name. But after re-reading the code looks like we do reassign the field IDs through schema.set_field_id(None) after initializing it from Arrow schema.

So in that case returning the fields definitely make more sense, and I can convert the primary key information to be stored in the fields so it is resilient against field rename. Let me redo this integration.

Ah, this is something that is quite confusing and might be nice to clean up at some point.

A lance schema has field ids, an arrow schema does not. When we convert from an arrow schema to a lance schema we assign field ids in DFS order. However, these are not often the correct field ids. The correct field ids are those stored on the dataset (e.g. there might be a gap because a field was added and renamed).

To get a schema with the correct field ids you need to start with the dataset schema (and possibly project from there). So ds.primary_key() should work since it is operating on the dataset schema.

westonpace · 2025-06-16T16:47:34Z

+    /// The primary key of the schema is set in the schema metadata
+    /// using key [`LANCE_SCHEMA_PRIMARY_KEY`] and specify the column name in value.
+    /// If this is a composite primary key, use `,` to delimit the column name.
+    /// If you need to override the delimiter, use key [`LANCE_SCHEMA_PRIMARY_KEY_DELIMITER`]


I have other questions like "what types of columns can be primary keys? Can the primary key be nullable? Must the primary key be unique? Are the primary key strings case sensitive?" Do you want to put those here or defer them for a future PR when we have something more user facing?

Currently the criteria is just not nullable. This is an unenforced primary key, I can make the name reflect that.

Are the primary key strings case sensitive

My understanding is that case sensitivity is a compute level setting rather than storage level. What is currently the behavior of Lance?

Do you want to put those here or defer them for a future PR when we have something more user facing?

Regarding the user facing experience, my goal is that we have this PR out and then Flink side can directly leverage it. So if you think this is not sufficient then let's iterate here.

My original thinking is that the current change provides a UX that is sufficient because they can set the primary key when they initially load into the dataset:

schema = pa.schema([ ("name", pa.string()), ("age", pa.int32()), ], metadata={"primary_key": "name"}) ds = lance.write_dataset(producer(), "./alice_and_bob.lance", schema=schema, mode="overwrite")

and then when they want to know what are the primary key columns, they just do

ds.primary_key()

Alternatively, we could also have an additional input for APIs like

ds = lance.write_dataset(producer(), "./alice_and_bob.lance", schema=schema, mode="overwrite", primary_key=["name"])

to externally supply the primary key outside the schema.

My feeling is that primary key is a part of the schema definition, supplying it as a part of the schema feels more natural to me, and since we are using Arrow schema as the main user interaction point, metadata is probably the easiest place to put the information. But I can also see a lot of config options for the write path, so adding it as another input option also seems okay.

Not sure if we have a preference here or any established patterns @westonpace

My feeling is that primary key is a part of the schema definition, supplying it as a part of the schema feels more natural to me, and since we are using Arrow schema as the main user interaction point, metadata is probably the easiest place to put the information.

This makes sense. We use the schema to store things like this (e.g. field embeddings) in lancedb as well. I do think the schema is a good place for it. I had thought maybe your plan was "lance schema has primary key as a field (like field id)" and "arrow schema has primary key in metadata". But both of them using metadata works too.

jackye1995 · 2025-06-16T17:37:22Z

I'm somewhat skeptical of a PR that doesn't actually do anything just because things tend to change by the time it actually gets used. However, the logic seems ok.

Yeah I was originally having everything in my own branch for testing things end to end. #3961 brought up that they need some notion of primary key to develop the Flink integration that's why I separated this out first

westonpace · 2025-06-16T21:31:33Z

Yeah I was originally having everything in my own branch for testing things end to end. #3961 brought up that they need some notion of primary key to develop the Flink integration that's why I separated this out first

Is this just syntactic sugar for metadata then? Or do you envision lance might use this at some point too?

jackye1995 · 2025-06-17T18:22:02Z

Is this just syntactic sugar for metadata then? Or do you envision lance might use this at some point too?

For the MemTable work I am also using it. Basically when reading, the data from memtable and on-disk table is merged based on the primary key. That's why I have it in my branch.

jackye1995 · 2025-06-17T18:23:35Z

@westonpace I did another version of implementation and directly annotate dthe specific field as an enforced primary key. It mostly follows the way you did lance-schema:storage-class, I feel it is a better pattern compared to using the schema metadata. Let me know what you think!

jackye1995 · 2025-06-18T05:15:11Z

@westonpace I will merge first so I can rebase and publish the MemTable PR, and we can discuss further over there as a concrete use case and it clears the blocker for the Flink side for now.

github-actions Bot added the enhancement New feature or request label Jun 13, 2025

jackye1995 force-pushed the pk branch from 1d5e769 to 3953535 Compare June 13, 2025 07:06

jackye1995 force-pushed the pk branch from 3953535 to 9695526 Compare June 13, 2025 19:42

westonpace approved these changes Jun 16, 2025

View reviewed changes

feat: support primary key concept in schema

a045e4e

jackye1995 force-pushed the pk branch from 166766e to a045e4e Compare June 17, 2025 18:20

jackye1995 changed the title ~~feat: support primary key concept in schema~~ feat: support unenforced primary key concept in schema Jun 17, 2025

jackye1995 added 3 commits June 17, 2025 12:02

add requirement of no list items

e31d28c

fix fmt

5542c20

remove unused match

acfa9b2

jackye1995 force-pushed the pk branch from 703881d to 5542c20 Compare June 17, 2025 19:40

jackye1995 closed this Jun 17, 2025

jackye1995 reopened this Jun 17, 2025

fix clippy

92938f4

jackye1995 merged commit 84e33eb into lance-format:main Jun 18, 2025
25 of 27 checks passed

jackye1995 mentioned this pull request Jul 16, 2025

[MemTable] Make unenforced primary key columns ordered #4239

Open

Conversation

jackye1995 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Jun 16, 2025

Uh oh!

westonpace commented Jun 16, 2025

Uh oh!

jackye1995 commented Jun 17, 2025

Uh oh!

jackye1995 commented Jun 17, 2025

Uh oh!

jackye1995 commented Jun 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jackye1995 commented Jun 13, 2025 •

edited

Loading

codecov-commenter commented Jun 13, 2025 •

edited

Loading