feat!: support shallow_clone in dataset#4257
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4257 +/- ##
==========================================
- Coverage 81.13% 81.11% -0.03%
==========================================
Files 308 308
Lines 113944 114296 +352
Branches 113944 114296 +352
==========================================
+ Hits 92448 92706 +258
- Misses 18238 18318 +80
- Partials 3258 3272 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This is actually different from what I was thinking about doing shallow clone. Looks like what we are doing here is that we add a reference object so that the data and delete file paths are resolved using the reference, instead of using the table's root location. I was thinking about instead of introducing a Clone operation and a Reference object in manifest, when we do shallow clone, we convert all the relative paths to absolute paths. And we fix the reader to allow reading absolute path instead of always resolving to a relative path. I think that is also how Delta does shallow clone: https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/CloneTableBase.scala#L147-L151 Overall, my feeling is that using absolute paths is better because with reference, it is unclear how we will resolve file paths after we write new fragments, some needs to be resolved using the reference dataset, some needs to use the current new dataset, and things get complicated quickly. Also absolute paths will be useful also for other cases like import files, so it's in general a good feature to have. What do you think? |
|
cc @wjones127 if you have any thoughts on this |
Thanks for the delta input. I will look into it. Let's take it into two parts: Clone operation.I think for now every manifest/version has a related transaction. What do we put in the cloned dataset as the initialized transaction like Overwrite if we don't have a clone operation? It will help if you can elaborate more about your thoughts.(I think this may be unrelated to the reference thing?) Absolute path.I did some work to use absolute path in this case. And some mind came up to me:
This is my mindmap. And I thought this reference was quite lance unique thing, because of the fragment id and the deletion file construction
For now I think the 'unclear' thing is a simple condition for both data file and deletion file. But I maybe wrong or misunderstand something. I'm not insistent on this reference methenism. But I think it will help if we can talk deeper into it and do some comparision.
I think I need to learn more from scenarios like import files you metioned. Apparently I don't get the whole picture. Overall I‘m willing to embrace using absolute paths in each datafiles, and I'm open to improve this prototype by anyways. |
Yeah My original thought is that it will just reflect the latest operation done to the original table. But yeah I think you are right it is better to just describe it is a clone from the source table.
yeah I am only thinking using absolute path for the initial manifest content. if there are new fragments after new operations to the cloned table, then those will take relative path, and that is not ambiguous because we always use the source location to derive the full path if it is relative.
ah good point. I did not know that the delete file path is constructed instead of a relative path. Seems like if there is some importance in using the constructed path for delete file, then I agree using reference is a good idea, the design looks like a good start. I think it needs to be a list of reference instead of just one, because there can be situations like I clone a table and then after a while I want to clone again, and then we will need to resolve file paths from multiple tables. And you might need to create some more actions like But it feels like hard-coding the delete file path could be limiting, if we want to do features like multi-tier storage and move files around tiers then it will become a blocker again. @wjones127 what were the considerations around this if you could share the historical context?
I am talking about features like https://iceberg.apache.org/docs/latest/spark-procedures/#add_files. |
Yes I think this is a good point. In my mind this was a nested reference like: I think a reference list would be better
I thougt a reference could be deleted automatically when the refered versions are cleaned up in the cloned dataset. Can you elaborate a little about the 'DropReference' scenario? |
|
I chatted with Will and Weston lately, here is a summary: Hard coded pathIn general for the long term, we should have a reference for all file paths instead of having file paths. Originally paths like delete file are hard-coded to save manifest size. For now, regarding the shallow clone feature, we should aim for just copying the files with hard coded paths to the cloned dataset. Over time, as we move more files as reference, shallow clone will copy less files. Relative paths for different basesThe easiest way to handle this is to just store absolute path, but that could inflate the overall manifest size. The reference approach you introduced is one way to handle that, Weston suggested another potentially simpler way: For each path, we store the tuple of (base, path), where base is not the base path itself, but just a symbol. At manifest level, we store a base to URI mapping to resolve it. For example, consider mapping At protobuf deifnition level, this probably means an optional base in places like message DataFile {
// Relative path to the base
string path = 1;
...
// optional base, if not the root table location
optional string base = 7;
}For cloning/branching, we can rewrite the manifest to add a base for the paths in the manifest. Other optimizationsWe should also add compression to manifest to keep the manifest size in control. Currently protobuf does very little with string compression. @majin1102 please let us know what you think, if you agree with these ideas or have any other suggestions! |
|
This approach is highly impressive to me—it seems to cover everything
I think some details could be discussed:
|
14b76b8 to
3e56ea2
Compare
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
|
This is ready for another review. @jackye1995 |
jackye1995
left a comment
There was a problem hiding this comment.
I left some minor nit comments, but apart from that this looks good to me!
| long readVersion, | ||
| Long numDeletedRows, | ||
| DeletionFileType fileType, | ||
| Integer pathBaseIndex) { |
There was a problem hiding this comment.
nit: variable name not upated, should be baseId
| // - true: Performs a metadata-only clone (copies manifest without data files). | ||
| // The cloned dataset references original data through `base_paths`, | ||
| // suitable for experimental scenarios or rapid metadata migration. | ||
| // - false: Performs a full deep clone using the underlying object storage's native |
There was a problem hiding this comment.
if we actually do deep clone like what is described here, can we still commit the cloned dataset with this operation? It seems like we cannot, because we are just bulk-copying the directory to somewhere else?
There was a problem hiding this comment.
In my mental mind, we should copy each datafiles at the target version instead of bulk-copying the directory which could benifit:
- Compared to copying the whole directory, we only focus the files at the target version which makes the cloning operation lightweight somehow.
- Compared to read-and-write, we don't need to load arrow data and this should be faster somehow.
- Moreover, I think we may provide some properties to config if we need multiple versions or include index, that will make this function more flexable.
What do you think?
There was a problem hiding this comment.
Today many users actually do directly copy a Lance dataset to cloud. This is especially popular for ML scientists that do the initial dataset locally then copy it and hand over to a team to bulk load the rest. I think what you say makes sense. Maybe if the storage supports it we can also copy the entire table directory except the version directory and then commit the deep cloned manifest.
Regarding configs to include index, technically that could be an option but I don't know why we would desire no index if we support it.
For multiple version, that I am not so sure what would be the use case. Especially we consider the commit history to be linear. It's hard to comprehend what it means to have multiple Clone transactions on top of each other.
But overall I think we don't need to hang on too long here, we can discuss these more in the Discussion thread.
| } | ||
|
|
||
| impl DataFile { | ||
| pub fn refer(datafile: &Self, base_id: u32) -> Self { |
There was a problem hiding this comment.
nit: it is not clear to me that refer actually means "refer to a base ID", can we use a more explicit function name like with_base_id?
There was a problem hiding this comment.
Actually this is not used anymore. Now we clone the datafiles and directly set the value:
let cloned_fragments = self
.fragments
.as_ref()
.iter()
.map(|fragment| {
let mut cloned_fragment = fragment.clone();
cloned_fragment.files = cloned_fragment
.files
.into_iter()
.map(|mut file| {
file.base_id = Some(new_base_id);
file
})
.collect();
if let Some(mut deletion) = cloned_fragment.deletion_file.take() {
deletion.base_id = Some(new_base_id);
cloned_fragment.deletion_file = Some(deletion);
}
cloned_fragment
})
I will delete it
| // Flag indicating whether this path is a dataset root path or file directory: | ||
| // - true: Path is a dataset root (actual files under subdirectories like `data`, '_deletions') | ||
| // - false: Path is a direct file directory (scenario like importing files) | ||
| bool dataset_base = 3; |
There was a problem hiding this comment.
this name feels a bit confusing, because there is a "base path", and then this "dataset base" is a boolean, and also we don't really have a concept of "dataset base" so far, we have been calling it a "database root". Can we name it something like "is_dataset_root"?
There was a problem hiding this comment.
I must say the "is_dataset_root" flashed in my mind. I did't recall why did't choose that
| ) | ||
| })?; | ||
|
|
||
| Ok(Path::parse(base_path.path.as_str())?) |
There was a problem hiding this comment.
we should add a check that the BasePath must be a dataset root.
| ref_name: &str, | ||
| store_params: ObjectStoreParams, | ||
| ) -> Result<Self> { | ||
| // self.tags.create(ref_name, self.version().version).await?; |
There was a problem hiding this comment.
remove this line? Or I think we can have an optional flag of create_tag which can run this if equal to true.
There was a problem hiding this comment.
Yeah, I think this is a to-be-discussed left behind.
Do you think we should bind shallow_clone to tags? or we could directly shallow_clone from a u32 version. (make the parameter as an impl Into<refs::Ref> like checkout_version)
Of course the lifetime of files are not garenteed. I think this is an authurity issue: normally the shallow_cloning user should be a reader on the source dataset and shallow_clone itself like a reading operation on the source, binding to tags means this operation must depend on a tag write/manage operation, not that reasonable. Consider the case users only want to do a very short experiment on the latest verion, I think it could be independent to tags.
Also IMO auto create tag in shallow_clone is not very reasonable according to the least privilege princeple.
What do you think about this? This would make the ref_name as an Option<String> which quite makes sense to me @jackye1995
There was a problem hiding this comment.
Sounds good to me, agree separating it makes more sense
| Ok(()) | ||
| } | ||
|
|
||
| pub async fn shallow_clone( |
There was a problem hiding this comment.
we should add documentation given this is a public API
|
Oh one more thing I forgot to mention, I think for this feature we need to update the |
|
Newest update: address comments with two discussions left behind:
Then let me raise another PR to do this(immediately after this PR). Thank you |
Please take another look when you have time |
|
Thanks for all the work, looking forward to the next steps!! |
Context: #4257 (comment) --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
Context: lance-format#4257 (comment) --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>

This is the prototype of shallow clone.
Design document: #4256
I did tests on some real datasets. It's OK to do some basic read and write. We may need some deeper testcases.
If there are confusing concepts or details. Please let me know or refer to the document.
@jackye1995 cc @wojiaodoubao