feat!: support shallow_clone in dataset by majin1102 · Pull Request #4257 · lance-format/lance

majin1102 · 2025-07-18T13:14:16Z

This is the prototype of shallow clone.

Design document: #4256

I did tests on some real datasets. It's OK to do some basic read and write. We may need some deeper testcases.

If there are confusing concepts or details. Please let me know or refer to the document.

@jackye1995 cc @wojiaodoubao

codecov-commenter · 2025-07-18T14:27:38Z

Codecov Report

❌ Patch coverage is 71.61458% with 109 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.11%. Comparing base (74eefec) to head (ba92e3e).

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/transaction.rs	25.00%	54 Missing ⚠️
rust/lance/src/dataset.rs	80.00%	25 Missing and 4 partials ⚠️
rust/lance/src/dataset/fragment.rs	33.33%	9 Missing and 5 partials ⚠️
rust/lance-table/src/format/manifest.rs	94.44%	2 Missing and 3 partials ⚠️
rust/lance/src/io/commit.rs	86.66%	1 Missing and 3 partials ⚠️
rust/lance/src/dataset/write/commit.rs	75.00%	1 Missing ⚠️
rust/lance/src/io/commit/conflict_resolver.rs	75.00%	1 Missing ⚠️
rust/lance/src/io/deletion.rs	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4257      +/-   ##
==========================================
- Coverage   81.13%   81.11%   -0.03%     
==========================================
  Files         308      308              
  Lines      113944   114296     +352     
  Branches   113944   114296     +352     
==========================================
+ Hits        92448    92706     +258     
- Misses      18238    18318      +80     
- Partials     3258     3272      +14

Flag	Coverage Δ
unittests	`81.11% <71.61%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jackye1995 · 2025-07-18T18:56:51Z

This is actually different from what I was thinking about doing shallow clone. Looks like what we are doing here is that we add a reference object so that the data and delete file paths are resolved using the reference, instead of using the table's root location.

I was thinking about instead of introducing a Clone operation and a Reference object in manifest, when we do shallow clone, we convert all the relative paths to absolute paths. And we fix the reader to allow reading absolute path instead of always resolving to a relative path. I think that is also how Delta does shallow clone: https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/CloneTableBase.scala#L147-L151

Overall, my feeling is that using absolute paths is better because with reference, it is unclear how we will resolve file paths after we write new fragments, some needs to be resolved using the reference dataset, some needs to use the current new dataset, and things get complicated quickly. Also absolute paths will be useful also for other cases like import files, so it's in general a good feature to have.

What do you think?

jackye1995 · 2025-07-18T18:57:40Z

cc @wjones127 if you have any thoughts on this

majin1102 · 2025-07-18T20:02:41Z

I was thinking about instead of introducing a Clone operation and a Reference object in manifest, when we do shallow clone, we convert all the relative paths to absolute paths. And we fix the reader to allow reading absolute path instead of always resolving to a relative path. I think that is also how Delta does shallow clone: https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/CloneTableBase.scala#L147-L151

Thanks for the delta input. I will look into it.

Let's take it into two parts:

Clone operation.

I think for now every manifest/version has a related transaction. What do we put in the cloned dataset as the initialized transaction like Overwrite if we don't have a clone operation? It will help if you can elaborate more about your thoughts.(I think this may be unrelated to the reference thing?)

Absolute path.

I did some work to use absolute path in this case. And some mind came up to me:

when we talk about absolute path, we actually use a dataset path to get a data directory and the filename. What if we just make the source dataset as an absolute path. This seems lighter and friendly to goverment tools. We get absolute file paths from reference eventually in this case.
I was thinking: what if the cloned dataset overwritten or totally updated. Then it comes to a normal dataset which has no actual relations with the source dataset. Somehow I thought we must use absolute path here is some sort of limitation. Or let's say absolute path could be an independent feature or property. I thought it may be not that necessary to bound to this case.
For deletion file, there is not even a relative path, the file name is assembled. I need to add a path to the spec. I think it will work, just a little heavy for this case.(Acutally this is the first problem I met)
Later I tried to figure out if there are blocking points to make this reference happen. I found the logics is quite clear:
- If fragment_id > reference.max_fragment_id, use the cloned dataset relative path. Otherwise, use the source dataset path.
- if deletion_file.read_version >= reference.version, use the cloned dataset relative path. Otherwise, use the source dataset path.

This is my mindmap. And I thought this reference was quite lance unique thing, because of the fragment id and the deletion file construction

it is unclear how we will resolve file paths after we write new fragments, some needs to be resolved using the reference dataset

For now I think the 'unclear' thing is a simple condition for both data file and deletion file. But I maybe wrong or misunderstand something. I'm not insistent on this reference methenism. But I think it will help if we can talk deeper into it and do some comparision.

Also absolute paths will be useful also for other cases like import files, so it's in general a good feature to have.

I think I need to learn more from scenarios like import files you metioned. Apparently I don't get the whole picture. Overall I‘m willing to embrace using absolute paths in each datafiles, and I'm open to improve this prototype by anyways.

jackye1995 · 2025-07-19T00:00:57Z

I think for now every manifest/version has a related transaction. What do we put in the cloned dataset as the initialized transaction like Overwrite if we don't have a clone operation? It will help if you can elaborate more about your thoughts.(I think this may be unrelated to the reference thing?)

Yeah My original thought is that it will just reflect the latest operation done to the original table. But yeah I think you are right it is better to just describe it is a clone from the source table.

what if the cloned dataset overwritten or totally updated. Then it comes to a normal dataset which has no actual relations with the source dataset.

yeah I am only thinking using absolute path for the initial manifest content. if there are new fragments after new operations to the cloned table, then those will take relative path, and that is not ambiguous because we always use the source location to derive the full path if it is relative.

For deletion file, there is not even a relative path, the file name is assembled. I need to add a path to the spec. I think it will work, just a little heavy for this case.

ah good point. I did not know that the delete file path is constructed instead of a relative path.

Seems like if there is some importance in using the constructed path for delete file, then I agree using reference is a good idea, the design looks like a good start. I think it needs to be a list of reference instead of just one, because there can be situations like I clone a table and then after a while I want to clone again, and then we will need to resolve file paths from multiple tables. And you might need to create some more actions like DropReference to fully remove a reference.

But it feels like hard-coding the delete file path could be limiting, if we want to do features like multi-tier storage and move files around tiers then it will become a blocker again. @wjones127 what were the considerations around this if you could share the historical context?

I think I need to learn more from scenarios like import files you metioned. Apparently I don't get the whole picture.

I am talking about features like https://iceberg.apache.org/docs/latest/spark-procedures/#add_files.

majin1102 · 2025-07-19T05:42:03Z

I think it needs to be a list of reference instead of just one, because there can be situations like I clone a table and then after a while I want to clone again, and then we will need to resolve file paths from multiple tables.

Yes I think this is a good point. In my mind this was a nested reference like:

Reference {
  source_path: xxx
  version: 10
  max_fragment_id: 100
  ......
  reference: {
    source_path: xxxxxx
    version: 5
    max_fragment:70
    ......
   }
}

I think a reference list would be better

And you might need to create some more actions like DropReference to fully remove a reference

I thougt a reference could be deleted automatically when the refered versions are cleaned up in the cloned dataset. Can you elaborate a little about the 'DropReference' scenario?

majin1102 · 2025-07-19T10:53:39Z

I think a field id list should be included in Reference struct for the case of merging column:

We will use the source root if the data file fragment id < reference.max_fragment_id and datafile field ids are all included in the reference

jackye1995 · 2025-07-22T20:42:12Z

I chatted with Will and Weston lately, here is a summary:

Hard coded path

In general for the long term, we should have a reference for all file paths instead of having file paths. Originally paths like delete file are hard-coded to save manifest size.

For now, regarding the shallow clone feature, we should aim for just copying the files with hard coded paths to the cloned dataset. Over time, as we move more files as reference, shallow clone will copy less files.

Relative paths for different bases

The easiest way to handle this is to just store absolute path, but that could inflate the overall manifest size. The reference approach you introduced is one way to handle that, Weston suggested another potentially simpler way:

For each path, we store the tuple of (base, path), where base is not the base path itself, but just a symbol. At manifest level, we store a base to URI mapping to resolve it. For example, consider mapping {"other", "/tmp/bar"}, then a path (other, data/fragment1.lance) will be resolved to /tmp/bar/data/fragment1.lance.

At protobuf deifnition level, this probably means an optional base in places like

message DataFile {
  // Relative path to the base
  string path = 1;
  ...
  // optional base, if not the root table location
  optional string base = 7;
}

For cloning/branching, we can rewrite the manifest to add a base for the paths in the manifest.

Other optimizations

We should also add compression to manifest to keep the manifest size in control. Currently protobuf does very little with string compression.

@majin1102 please let us know what you think, if you agree with these ideas or have any other suggestions!

majin1102 · 2025-07-23T04:00:15Z

This approach is highly impressive to me—it seems to cover everything

manifest size issue
reference complexity
extension for importing files.
potentially the base list could be modified easily if we had the migration scenarios

I think some details could be discussed:

The initialized version number of cloned dataset. For example we clone the version of 3, will the cloned dataset initialized at 3 or 1. I think both ways are reasonable. For me I picked 3 for a clear relationship with the source dataset in the prototype.
I think this approach will leave reference only a string for base path. The new definition as I understand is like:
optional repeated string base_paths
Do we agree? @jackye1995 @westonpace
Do we need a deep clone as metioned in From shallow clone to branching #4256?(just copy the target snapshot using object_store api). I think this approach is quite useful for those rich guys. WDYT @jackye1995

github-actions · 2025-08-04T13:15:11Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

majin1102 · 2025-08-20T18:09:40Z

This is ready for another review. @jackye1995

jackye1995

I left some minor nit comments, but apart from that this looks good to me!

jackye1995 · 2025-08-20T17:14:58Z

+      long readVersion,
+      Long numDeletedRows,
+      DeletionFileType fileType,
+      Integer pathBaseIndex) {


nit: variable name not upated, should be baseId

jackye1995 · 2025-08-21T03:44:48Z

+    // - true:  Performs a metadata-only clone (copies manifest without data files).
+    //          The cloned dataset references original data through `base_paths`,
+    //          suitable for experimental scenarios or rapid metadata migration.
+    // - false: Performs a full deep clone using the underlying object storage's native


if we actually do deep clone like what is described here, can we still commit the cloned dataset with this operation? It seems like we cannot, because we are just bulk-copying the directory to somewhere else?

In my mental mind, we should copy each datafiles at the target version instead of bulk-copying the directory which could benifit:

Compared to copying the whole directory, we only focus the files at the target version which makes the cloning operation lightweight somehow.

Compared to read-and-write, we don't need to load arrow data and this should be faster somehow.

Moreover, I think we may provide some properties to config if we need multiple versions or include index, that will make this function more flexable.

What do you think?

Today many users actually do directly copy a Lance dataset to cloud. This is especially popular for ML scientists that do the initial dataset locally then copy it and hand over to a team to bulk load the rest. I think what you say makes sense. Maybe if the storage supports it we can also copy the entire table directory except the version directory and then commit the deep cloned manifest.

Regarding configs to include index, technically that could be an option but I don't know why we would desire no index if we support it.

For multiple version, that I am not so sure what would be the use case. Especially we consider the commit history to be linear. It's hard to comprehend what it means to have multiple Clone transactions on top of each other.

But overall I think we don't need to hang on too long here, we can discuss these more in the Discussion thread.

jackye1995 · 2025-08-21T03:48:56Z

 }

 impl DataFile {
+    pub fn refer(datafile: &Self, base_id: u32) -> Self {


nit: it is not clear to me that refer actually means "refer to a base ID", can we use a more explicit function name like with_base_id?

Actually this is not used anymore. Now we clone the datafiles and directly set the value:

let cloned_fragments = self .fragments .as_ref() .iter() .map(|fragment| { let mut cloned_fragment = fragment.clone(); cloned_fragment.files = cloned_fragment .files .into_iter() .map(|mut file| { file.base_id = Some(new_base_id); file }) .collect(); if let Some(mut deletion) = cloned_fragment.deletion_file.take() { deletion.base_id = Some(new_base_id); cloned_fragment.deletion_file = Some(deletion); } cloned_fragment })

I will delete it

jackye1995 · 2025-08-21T03:54:24Z

+  // Flag indicating whether this path is a dataset root path or file directory:
+  // - true:  Path is a dataset root (actual files under subdirectories like `data`, '_deletions')
+  // - false: Path is a direct file directory (scenario like importing files)
+  bool dataset_base = 3;


this name feels a bit confusing, because there is a "base path", and then this "dataset base" is a boolean, and also we don't really have a concept of "dataset base" so far, we have been calling it a "database root". Can we name it something like "is_dataset_root"?

I must say the "is_dataset_root" flashed in my mind. I did't recall why did't choose that

jackye1995 · 2025-08-21T03:56:23Z

+                    )
+                })?;
+
+                Ok(Path::parse(base_path.path.as_str())?)


we should add a check that the BasePath must be a dataset root.

jackye1995 · 2025-08-21T03:57:29Z

+        ref_name: &str,
+        store_params: ObjectStoreParams,
+    ) -> Result<Self> {
+        // self.tags.create(ref_name, self.version().version).await?;


remove this line? Or I think we can have an optional flag of create_tag which can run this if equal to true.

Yeah, I think this is a to-be-discussed left behind.

Do you think we should bind shallow_clone to tags? or we could directly shallow_clone from a u32 version. (make the parameter as an impl Into<refs::Ref> like checkout_version)

Of course the lifetime of files are not garenteed. I think this is an authurity issue: normally the shallow_cloning user should be a reader on the source dataset and shallow_clone itself like a reading operation on the source, binding to tags means this operation must depend on a tag write/manage operation, not that reasonable. Consider the case users only want to do a very short experiment on the latest verion, I think it could be independent to tags.

Also IMO auto create tag in shallow_clone is not very reasonable according to the least privilege princeple.

What do you think about this? This would make the ref_name as an Option<String> which quite makes sense to me @jackye1995

Sounds good to me, agree separating it makes more sense

jackye1995 · 2025-08-21T03:57:41Z

        Ok(())
    }

+    pub async fn shallow_clone(


we should add documentation given this is a public API

jackye1995 · 2025-08-21T04:11:02Z

Oh one more thing I forgot to mention, I think for this feature we need to update the reader_feature_flags and writer_feature_flags accordingly. We should not let an old reader read a cloned dataset since that will cause correctness issue (most likely it will fail). We should also not let an old writer write to a cloned dataset, since that will revert the base paths set in the manifest. I am okay we do that in this PR or separately, up to you.

majin1102 · 2025-08-21T08:15:34Z

Newest update: address comments with two discussions left behind:

deep_clone description
tag decoupling

We should also not let an old writer write to a cloned dataset, since that will revert the base paths set in the manifest. I am okay we do that in this PR or separately, up to you.

Then let me raise another PR to do this（immediately after this PR). Thank you

majin1102 · 2025-08-23T13:08:23Z

Sounds good to me, agree separating it makes more sense

Please take another look when you have time

jackye1995 · 2025-08-23T16:17:05Z

Thanks for all the work, looking forward to the next steps!!

Context: #4257 (comment)

Context: #4257 (comment) --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>

Context: lance-format#4257 (comment) --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>

github-actions Bot added the enhancement New feature or request label Jul 18, 2025

majin1102 force-pushed the shallow-clone branch from 3ad4519 to 7e3b48e Compare July 18, 2025 13:41

majin1102 marked this pull request as draft July 18, 2025 20:56

majin1102 force-pushed the shallow-clone branch from fac34f2 to abf1671 Compare July 21, 2025 16:52

majin1102 added 7 commits August 4, 2025 16:35

Support shallow_clone

b89a21e

Fix ci issues

99c93da

Fix ci issues

c0e5d23

Optimize logic

a464b4e

cargo fmt

7c031b8

use a Hashmap to replace Reference

1d276c5

resolve conflict and fmt

3e56ea2

majin1102 force-pushed the shallow-clone branch from 14b76b8 to 3e56ea2 Compare August 4, 2025 10:05

majin1102 added 6 commits August 4, 2025 18:29

fix cargo clippy

02d7da5

fix cargo clippy

174d1da

fix cargo clippy

d0eec44

fix cargo clippy

c55b051

adapt python module

1fed672

cargo fmt --all

9d489ec

github-actions Bot added the python label Aug 4, 2025

majin1102 changed the title ~~feat: support shallow_clone in dataset~~ [WIP] feat: support shallow_clone in dataset Aug 4, 2025

cargo fmt --all

e5e5075

majin1102 marked this pull request as draft August 19, 2025 08:19

majin1102 added 5 commits August 20, 2025 23:24

adopt test suggestions

1394324

Merge branch 'main' into shallow-clone

2678a72

fix cargo clippy

36568b0

fix cargo clippy

39fbe80

Merge branch 'main' into shallow-clone

7fb4302

majin1102 marked this pull request as ready for review August 20, 2025 17:26

Merge branch 'main' into shallow-clone

4b54821

jackye1995 approved these changes Aug 21, 2025

View reviewed changes

majin1102 added 7 commits August 21, 2025 16:27

address review comments

56115a7

spotless

8048c36

Merge branch 'main' into shallow-clone

ce9cc8e

fix cargo clippy

8b84a35

Merge branch 'main' into shallow-clone

875b17e

use version instead of ref_name as parameter in shallow_clone interface

5a24f67

Merge branch 'main' into shallow-clone

ba92e3e

jackye1995 merged commit 746d2dd into lance-format:main Aug 23, 2025
26 checks passed

majin1102 deleted the shallow-clone branch August 24, 2025 04:58

This was referenced Aug 24, 2025

feat: provide feature flag for shallow cloning #4552

Merged

feat!: shallow_clone supports index #4553

Merged

jackye1995 pushed a commit that referenced this pull request Aug 25, 2025

feat: provide feature flag for shallow cloning (#4552)

d827307

Context: #4257 (comment)

jackye1995 added a commit that referenced this pull request Aug 28, 2025

feat!: shallow_clone supports index (#4553)

e2f08db

Context: #4257 (comment) --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>

majin1102 mentioned this pull request Sep 30, 2025

Dataset support basic shallow_clone #4255

Closed

jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 21, 2026

feat!: shallow_clone supports index (lance-format#4553)

fe07576

Context: lance-format#4257 (comment) --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>

Conversation

majin1102 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jackye1995 commented Jul 18, 2025

Uh oh!

jackye1995 commented Jul 18, 2025

Uh oh!

majin1102 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Clone operation.

Absolute path.

Uh oh!

jackye1995 commented Jul 19, 2025

Uh oh!

majin1102 commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majin1102 commented Jul 19, 2025

Uh oh!

jackye1995 commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hard coded path

Relative paths for different bases

Other optimizations

Uh oh!

majin1102 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Aug 4, 2025

Uh oh!

majin1102 commented Aug 20, 2025

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majin1102 Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented Aug 21, 2025

Uh oh!

majin1102 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

majin1102 commented Aug 23, 2025

Uh oh!

jackye1995 commented Aug 23, 2025

Uh oh!

Uh oh!

majin1102 commented Jul 18, 2025 •

edited

Loading

codecov-commenter commented Jul 18, 2025 •

edited

Loading

majin1102 commented Jul 18, 2025 •

edited

Loading

majin1102 commented Jul 19, 2025 •

edited

Loading

jackye1995 commented Jul 22, 2025 •

edited

Loading

majin1102 commented Jul 23, 2025 •

edited

Loading

jackye1995 Aug 21, 2025 •

edited

Loading

majin1102 Aug 21, 2025 •

edited

Loading

majin1102 commented Aug 21, 2025 •

edited

Loading