feat: support unenforced primary key concept in schema#4002
feat: support unenforced primary key concept in schema#4002jackye1995 merged 5 commits intolance-format:mainfrom
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4002 +/- ##
==========================================
+ Coverage 78.60% 78.63% +0.02%
==========================================
Files 285 285
Lines 113006 113198 +192
Branches 113006 113198 +192
==========================================
+ Hits 88833 89014 +181
- Misses 20736 20746 +10
- Partials 3437 3438 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
I'm somewhat skeptical of a PR that doesn't actually do anything just because things tend to change by the time it actually gets used. However, the logic seems ok.
| if let Some(pk) = schema.primary_key() { | ||
| let mut seen = HashSet::new(); | ||
| for pk_col in pk { | ||
| if seen.contains(&pk_col) { | ||
| return Err(Error::Schema { | ||
| message: format!( | ||
| "Primary key cannot contain multiple copies of the same column: {}", | ||
| pk_col | ||
| ), | ||
| location: location!(), | ||
| }); | ||
| } | ||
| if let Some(field) = schema.field(&pk_col) { | ||
| if field.nullable { | ||
| return Err(Error::Schema { | ||
| message: format!("Primary key column must not be nullable: {}", pk_col), | ||
| location: location!(), | ||
| }); | ||
| } | ||
| seen.insert(pk_col); | ||
| } else { | ||
| return Err(Error::Schema { | ||
| message: format!("Primary key column does not exist: {}", pk_col), | ||
| location: location!(), | ||
| }); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Right now we are just validating. Is the plan to put pk in the lance schema object at some point?
| /// using key [`LANCE_SCHEMA_PRIMARY_KEY`] and specify the column name in value. | ||
| /// If this is a composite primary key, use `,` to delimit the column name. | ||
| /// If you need to override the delimiter, use key [`LANCE_SCHEMA_PRIMARY_KEY_DELIMITER`] | ||
| pub fn primary_key(&self) -> Option<Vec<String>> { |
There was a problem hiding this comment.
Should this return a vector of fields or field indices?
There was a problem hiding this comment.
Okay this question actually clarified a misunderstanding I had... When I originally looked at the code I was wondering why we directly set -1 for all field IDs, so I preserved to store primary keys by name. But after re-reading the code looks like we do reassign the field IDs through schema.set_field_id(None) after initializing it from Arrow schema.
So in that case returning the fields definitely make more sense, and I can convert the primary key information to be stored in the fields so it is resilient against field rename. Let me redo this integration.
There was a problem hiding this comment.
Ah, this is something that is quite confusing and might be nice to clean up at some point.
A lance schema has field ids, an arrow schema does not. When we convert from an arrow schema to a lance schema we assign field ids in DFS order. However, these are not often the correct field ids. The correct field ids are those stored on the dataset (e.g. there might be a gap because a field was added and renamed).
To get a schema with the correct field ids you need to start with the dataset schema (and possibly project from there). So ds.primary_key() should work since it is operating on the dataset schema.
| /// The primary key of the schema is set in the schema metadata | ||
| /// using key [`LANCE_SCHEMA_PRIMARY_KEY`] and specify the column name in value. | ||
| /// If this is a composite primary key, use `,` to delimit the column name. | ||
| /// If you need to override the delimiter, use key [`LANCE_SCHEMA_PRIMARY_KEY_DELIMITER`] |
There was a problem hiding this comment.
I have other questions like "what types of columns can be primary keys? Can the primary key be nullable? Must the primary key be unique? Are the primary key strings case sensitive?" Do you want to put those here or defer them for a future PR when we have something more user facing?
There was a problem hiding this comment.
Currently the criteria is just not nullable. This is an unenforced primary key, I can make the name reflect that.
Are the primary key strings case sensitive
My understanding is that case sensitivity is a compute level setting rather than storage level. What is currently the behavior of Lance?
There was a problem hiding this comment.
Do you want to put those here or defer them for a future PR when we have something more user facing?
Regarding the user facing experience, my goal is that we have this PR out and then Flink side can directly leverage it. So if you think this is not sufficient then let's iterate here.
My original thinking is that the current change provides a UX that is sufficient because they can set the primary key when they initially load into the dataset:
schema = pa.schema([
("name", pa.string()),
("age", pa.int32()),
], metadata={"primary_key": "name"})
ds = lance.write_dataset(producer(),
"./alice_and_bob.lance",
schema=schema, mode="overwrite")and then when they want to know what are the primary key columns, they just do
ds.primary_key()Alternatively, we could also have an additional input for APIs like
ds = lance.write_dataset(producer(),
"./alice_and_bob.lance",
schema=schema, mode="overwrite",
primary_key=["name"])to externally supply the primary key outside the schema.
My feeling is that primary key is a part of the schema definition, supplying it as a part of the schema feels more natural to me, and since we are using Arrow schema as the main user interaction point, metadata is probably the easiest place to put the information. But I can also see a lot of config options for the write path, so adding it as another input option also seems okay.
Not sure if we have a preference here or any established patterns @westonpace
There was a problem hiding this comment.
My feeling is that primary key is a part of the schema definition, supplying it as a part of the schema feels more natural to me, and since we are using Arrow schema as the main user interaction point, metadata is probably the easiest place to put the information.
This makes sense. We use the schema to store things like this (e.g. field embeddings) in lancedb as well. I do think the schema is a good place for it. I had thought maybe your plan was "lance schema has primary key as a field (like field id)" and "arrow schema has primary key in metadata". But both of them using metadata works too.
Yeah I was originally having everything in my own branch for testing things end to end. #3961 brought up that they need some notion of primary key to develop the Flink integration that's why I separated this out first |
Is this just syntactic sugar for metadata then? Or do you envision |
For the MemTable work I am also using it. Basically when reading, the data from memtable and on-disk table is merged based on the primary key. That's why I have it in my branch. |
|
@westonpace I did another version of implementation and directly annotate dthe specific field as an enforced primary key. It mostly follows the way you did |
|
@westonpace I will merge first so I can rebase and publish the MemTable PR, and we can discuss further over there as a concrete use case and it clears the blocker for the Flink side for now. |
Allow users to set primary key through Arrow schema metadata with
primary_keyconfig key. A primary key column must not be nullable. User can configure composite primary key through,delimiter, or using a custom delimiter specified by anotherprimary_key_delimconfig key.Closes #4003