Skip to content

feat!: support shallow_clone in dataset#4257

Merged
jackye1995 merged 76 commits intolance-format:mainfrom
majin1102:shallow-clone
Aug 23, 2025
Merged

feat!: support shallow_clone in dataset#4257
jackye1995 merged 76 commits intolance-format:mainfrom
majin1102:shallow-clone

Conversation

@majin1102
Copy link
Copy Markdown
Contributor

@majin1102 majin1102 commented Jul 18, 2025

This is the prototype of shallow clone.

Design document: #4256

I did tests on some real datasets. It's OK to do some basic read and write. We may need some deeper testcases.

If there are confusing concepts or details. Please let me know or refer to the document.

@jackye1995 cc @wojiaodoubao

@github-actions github-actions Bot added the enhancement New feature or request label Jul 18, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jul 18, 2025

Codecov Report

❌ Patch coverage is 71.61458% with 109 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.11%. Comparing base (74eefec) to head (ba92e3e).

Files with missing lines Patch % Lines
rust/lance/src/dataset/transaction.rs 25.00% 54 Missing ⚠️
rust/lance/src/dataset.rs 80.00% 25 Missing and 4 partials ⚠️
rust/lance/src/dataset/fragment.rs 33.33% 9 Missing and 5 partials ⚠️
rust/lance-table/src/format/manifest.rs 94.44% 2 Missing and 3 partials ⚠️
rust/lance/src/io/commit.rs 86.66% 1 Missing and 3 partials ⚠️
rust/lance/src/dataset/write/commit.rs 75.00% 1 Missing ⚠️
rust/lance/src/io/commit/conflict_resolver.rs 75.00% 1 Missing ⚠️
rust/lance/src/io/deletion.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4257      +/-   ##
==========================================
- Coverage   81.13%   81.11%   -0.03%     
==========================================
  Files         308      308              
  Lines      113944   114296     +352     
  Branches   113944   114296     +352     
==========================================
+ Hits        92448    92706     +258     
- Misses      18238    18318      +80     
- Partials     3258     3272      +14     
Flag Coverage Δ
unittests 81.11% <71.61%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jackye1995
Copy link
Copy Markdown
Contributor

This is actually different from what I was thinking about doing shallow clone. Looks like what we are doing here is that we add a reference object so that the data and delete file paths are resolved using the reference, instead of using the table's root location.

I was thinking about instead of introducing a Clone operation and a Reference object in manifest, when we do shallow clone, we convert all the relative paths to absolute paths. And we fix the reader to allow reading absolute path instead of always resolving to a relative path. I think that is also how Delta does shallow clone: https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/CloneTableBase.scala#L147-L151

Overall, my feeling is that using absolute paths is better because with reference, it is unclear how we will resolve file paths after we write new fragments, some needs to be resolved using the reference dataset, some needs to use the current new dataset, and things get complicated quickly. Also absolute paths will be useful also for other cases like import files, so it's in general a good feature to have.

What do you think?

@jackye1995
Copy link
Copy Markdown
Contributor

cc @wjones127 if you have any thoughts on this

@majin1102
Copy link
Copy Markdown
Contributor Author

majin1102 commented Jul 18, 2025

I was thinking about instead of introducing a Clone operation and a Reference object in manifest, when we do shallow clone, we convert all the relative paths to absolute paths. And we fix the reader to allow reading absolute path instead of always resolving to a relative path. I think that is also how Delta does shallow clone: https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/CloneTableBase.scala#L147-L151

Thanks for the delta input. I will look into it.

Let's take it into two parts:

Clone operation.

I think for now every manifest/version has a related transaction. What do we put in the cloned dataset as the initialized transaction like Overwrite if we don't have a clone operation? It will help if you can elaborate more about your thoughts.(I think this may be unrelated to the reference thing?)

Absolute path.

I did some work to use absolute path in this case. And some mind came up to me:

  1. when we talk about absolute path, we actually use a dataset path to get a data directory and the filename. What if we just make the source dataset as an absolute path. This seems lighter and friendly to goverment tools. We get absolute file paths from reference eventually in this case.
  2. I was thinking: what if the cloned dataset overwritten or totally updated. Then it comes to a normal dataset which has no actual relations with the source dataset. Somehow I thought we must use absolute path here is some sort of limitation. Or let's say absolute path could be an independent feature or property. I thought it may be not that necessary to bound to this case.
  3. For deletion file, there is not even a relative path, the file name is assembled. I need to add a path to the spec. I think it will work, just a little heavy for this case.(Acutally this is the first problem I met)
  4. Later I tried to figure out if there are blocking points to make this reference happen. I found the logics is quite clear:
    • If fragment_id > reference.max_fragment_id, use the cloned dataset relative path. Otherwise, use the source dataset path.
    • if deletion_file.read_version >= reference.version, use the cloned dataset relative path. Otherwise, use the source dataset path.

This is my mindmap. And I thought this reference was quite lance unique thing, because of the fragment id and the deletion file construction

it is unclear how we will resolve file paths after we write new fragments, some needs to be resolved using the reference dataset

For now I think the 'unclear' thing is a simple condition for both data file and deletion file. But I maybe wrong or misunderstand something. I'm not insistent on this reference methenism. But I think it will help if we can talk deeper into it and do some comparision.

Also absolute paths will be useful also for other cases like import files, so it's in general a good feature to have.

I think I need to learn more from scenarios like import files you metioned. Apparently I don't get the whole picture. Overall I‘m willing to embrace using absolute paths in each datafiles, and I'm open to improve this prototype by anyways.

@majin1102 majin1102 marked this pull request as draft July 18, 2025 20:56
@jackye1995
Copy link
Copy Markdown
Contributor

I think for now every manifest/version has a related transaction. What do we put in the cloned dataset as the initialized transaction like Overwrite if we don't have a clone operation? It will help if you can elaborate more about your thoughts.(I think this may be unrelated to the reference thing?)

Yeah My original thought is that it will just reflect the latest operation done to the original table. But yeah I think you are right it is better to just describe it is a clone from the source table.

what if the cloned dataset overwritten or totally updated. Then it comes to a normal dataset which has no actual relations with the source dataset.

yeah I am only thinking using absolute path for the initial manifest content. if there are new fragments after new operations to the cloned table, then those will take relative path, and that is not ambiguous because we always use the source location to derive the full path if it is relative.

For deletion file, there is not even a relative path, the file name is assembled. I need to add a path to the spec. I think it will work, just a little heavy for this case.

ah good point. I did not know that the delete file path is constructed instead of a relative path.

Seems like if there is some importance in using the constructed path for delete file, then I agree using reference is a good idea, the design looks like a good start. I think it needs to be a list of reference instead of just one, because there can be situations like I clone a table and then after a while I want to clone again, and then we will need to resolve file paths from multiple tables. And you might need to create some more actions like DropReference to fully remove a reference.

But it feels like hard-coding the delete file path could be limiting, if we want to do features like multi-tier storage and move files around tiers then it will become a blocker again. @wjones127 what were the considerations around this if you could share the historical context?

I think I need to learn more from scenarios like import files you metioned. Apparently I don't get the whole picture.

I am talking about features like https://iceberg.apache.org/docs/latest/spark-procedures/#add_files.

@majin1102
Copy link
Copy Markdown
Contributor Author

majin1102 commented Jul 19, 2025

I think it needs to be a list of reference instead of just one, because there can be situations like I clone a table and then after a while I want to clone again, and then we will need to resolve file paths from multiple tables.

Yes I think this is a good point. In my mind this was a nested reference like:

Reference {
  source_path: xxx
  version: 10
  max_fragment_id: 100
  ......
  reference: {
    source_path: xxxxxx
    version: 5
    max_fragment:70
    ......
   }
}

I think a reference list would be better

And you might need to create some more actions like DropReference to fully remove a reference

I thougt a reference could be deleted automatically when the refered versions are cleaned up in the cloned dataset. Can you elaborate a little about the 'DropReference' scenario?

@majin1102
Copy link
Copy Markdown
Contributor Author

I think a field id list should be included in Reference struct for the case of merging column:

branch drawio

We will use the source root if the data file fragment id < reference.max_fragment_id and datafile field ids are all included in the reference

@jackye1995
Copy link
Copy Markdown
Contributor

jackye1995 commented Jul 22, 2025

I chatted with Will and Weston lately, here is a summary:

Hard coded path

In general for the long term, we should have a reference for all file paths instead of having file paths. Originally paths like delete file are hard-coded to save manifest size.

For now, regarding the shallow clone feature, we should aim for just copying the files with hard coded paths to the cloned dataset. Over time, as we move more files as reference, shallow clone will copy less files.

Relative paths for different bases

The easiest way to handle this is to just store absolute path, but that could inflate the overall manifest size. The reference approach you introduced is one way to handle that, Weston suggested another potentially simpler way:

For each path, we store the tuple of (base, path), where base is not the base path itself, but just a symbol. At manifest level, we store a base to URI mapping to resolve it. For example, consider mapping {"other", "/tmp/bar"}, then a path (other, data/fragment1.lance) will be resolved to /tmp/bar/data/fragment1.lance.

At protobuf deifnition level, this probably means an optional base in places like

message DataFile {
  // Relative path to the base
  string path = 1;
  ...
  // optional base, if not the root table location
  optional string base = 7;
}

For cloning/branching, we can rewrite the manifest to add a base for the paths in the manifest.

Other optimizations

We should also add compression to manifest to keep the manifest size in control. Currently protobuf does very little with string compression.

@majin1102 please let us know what you think, if you agree with these ideas or have any other suggestions!

@majin1102
Copy link
Copy Markdown
Contributor Author

majin1102 commented Jul 23, 2025

This approach is highly impressive to me—it seems to cover everything

  • manifest size issue
  • reference complexity
  • extension for importing files.
  • potentially the base list could be modified easily if we had the migration scenarios

I think some details could be discussed:

  1. The initialized version number of cloned dataset. For example we clone the version of 3, will the cloned dataset initialized at 3 or 1. I think both ways are reasonable. For me I picked 3 for a clear relationship with the source dataset in the prototype.
  2. I think this approach will leave reference only a string for base path. The new definition as I understand is like:
    optional repeated string base_paths
    Do we agree? @jackye1995 @westonpace
  3. Do we need a deep clone as metioned in From shallow clone to branching #4256?(just copy the target snapshot using object_store api). I think this approach is quite useful for those rich guys. WDYT @jackye1995

@github-actions github-actions Bot added the python label Aug 4, 2025
@majin1102 majin1102 changed the title feat: support shallow_clone in dataset [WIP] feat: support shallow_clone in dataset Aug 4, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Aug 4, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@majin1102 majin1102 marked this pull request as draft August 19, 2025 08:19
@majin1102 majin1102 marked this pull request as ready for review August 20, 2025 17:26
@majin1102
Copy link
Copy Markdown
Contributor Author

This is ready for another review. @jackye1995

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor nit comments, but apart from that this looks good to me!

long readVersion,
Long numDeletedRows,
DeletionFileType fileType,
Integer pathBaseIndex) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: variable name not upated, should be baseId

Comment thread protos/transaction.proto
// - true: Performs a metadata-only clone (copies manifest without data files).
// The cloned dataset references original data through `base_paths`,
// suitable for experimental scenarios or rapid metadata migration.
// - false: Performs a full deep clone using the underlying object storage's native
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we actually do deep clone like what is described here, can we still commit the cloned dataset with this operation? It seems like we cannot, because we are just bulk-copying the directory to somewhere else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mental mind, we should copy each datafiles at the target version instead of bulk-copying the directory which could benifit:

  1. Compared to copying the whole directory, we only focus the files at the target version which makes the cloning operation lightweight somehow.
  2. Compared to read-and-write, we don't need to load arrow data and this should be faster somehow.
  3. Moreover, I think we may provide some properties to config if we need multiple versions or include index, that will make this function more flexable.

What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today many users actually do directly copy a Lance dataset to cloud. This is especially popular for ML scientists that do the initial dataset locally then copy it and hand over to a team to bulk load the rest. I think what you say makes sense. Maybe if the storage supports it we can also copy the entire table directory except the version directory and then commit the deep cloned manifest.

Regarding configs to include index, technically that could be an option but I don't know why we would desire no index if we support it.

For multiple version, that I am not so sure what would be the use case. Especially we consider the commit history to be linear. It's hard to comprehend what it means to have multiple Clone transactions on top of each other.

But overall I think we don't need to hang on too long here, we can discuss these more in the Discussion thread.

Comment thread rust/lance-table/src/format/fragment.rs Outdated
}

impl DataFile {
pub fn refer(datafile: &Self, base_id: u32) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it is not clear to me that refer actually means "refer to a base ID", can we use a more explicit function name like with_base_id?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not used anymore. Now we clone the datafiles and directly set the value:

    let cloned_fragments = self
            .fragments
            .as_ref()
            .iter()
            .map(|fragment| {
                let mut cloned_fragment = fragment.clone();
                cloned_fragment.files = cloned_fragment
                    .files
                    .into_iter()
                    .map(|mut file| {
                        file.base_id = Some(new_base_id);
                        file
                    })
                    .collect();

                if let Some(mut deletion) = cloned_fragment.deletion_file.take() {
                    deletion.base_id = Some(new_base_id);
                    cloned_fragment.deletion_file = Some(deletion);
                }

                cloned_fragment
            })

I will delete it

Comment thread protos/table.proto Outdated
// Flag indicating whether this path is a dataset root path or file directory:
// - true: Path is a dataset root (actual files under subdirectories like `data`, '_deletions')
// - false: Path is a direct file directory (scenario like importing files)
bool dataset_base = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name feels a bit confusing, because there is a "base path", and then this "dataset base" is a boolean, and also we don't really have a concept of "dataset base" so far, we have been calling it a "database root". Can we name it something like "is_dataset_root"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say the "is_dataset_root" flashed in my mind. I did't recall why did't choose that

Comment thread rust/lance/src/dataset.rs
)
})?;

Ok(Path::parse(base_path.path.as_str())?)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a check that the BasePath must be a dataset root.

Comment thread rust/lance/src/dataset.rs Outdated
ref_name: &str,
store_params: ObjectStoreParams,
) -> Result<Self> {
// self.tags.create(ref_name, self.version().version).await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line? Or I think we can have an optional flag of create_tag which can run this if equal to true.

Copy link
Copy Markdown
Contributor Author

@majin1102 majin1102 Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is a to-be-discussed left behind.

Do you think we should bind shallow_clone to tags? or we could directly shallow_clone from a u32 version. (make the parameter as an impl Into<refs::Ref> like checkout_version)

Of course the lifetime of files are not garenteed. I think this is an authurity issue: normally the shallow_cloning user should be a reader on the source dataset and shallow_clone itself like a reading operation on the source, binding to tags means this operation must depend on a tag write/manage operation, not that reasonable. Consider the case users only want to do a very short experiment on the latest verion, I think it could be independent to tags.

Also IMO auto create tag in shallow_clone is not very reasonable according to the least privilege princeple.

What do you think about this? This would make the ref_name as an Option<String> which quite makes sense to me @jackye1995

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, agree separating it makes more sense

Comment thread rust/lance/src/dataset.rs
Ok(())
}

pub async fn shallow_clone(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add documentation given this is a public API

@jackye1995
Copy link
Copy Markdown
Contributor

Oh one more thing I forgot to mention, I think for this feature we need to update the reader_feature_flags and writer_feature_flags accordingly. We should not let an old reader read a cloned dataset since that will cause correctness issue (most likely it will fail). We should also not let an old writer write to a cloned dataset, since that will revert the base paths set in the manifest. I am okay we do that in this PR or separately, up to you.

@majin1102
Copy link
Copy Markdown
Contributor Author

majin1102 commented Aug 21, 2025

Newest update: address comments with two discussions left behind:

  1. deep_clone description
  2. tag decoupling

We should also not let an old writer write to a cloned dataset, since that will revert the base paths set in the manifest. I am okay we do that in this PR or separately, up to you.

Then let me raise another PR to do this(immediately after this PR). Thank you

@majin1102
Copy link
Copy Markdown
Contributor Author

Sounds good to me, agree separating it makes more sense

Please take another look when you have time

@jackye1995
Copy link
Copy Markdown
Contributor

Thanks for all the work, looking forward to the next steps!!

@jackye1995 jackye1995 merged commit 746d2dd into lance-format:main Aug 23, 2025
26 checks passed
@majin1102 majin1102 deleted the shallow-clone branch August 24, 2025 04:58
jackye1995 added a commit that referenced this pull request Aug 28, 2025
Context:

#4257 (comment)

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
jackye1995 added a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Context:

lance-format#4257 (comment)

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants