Skip to content

feat: dataset supports deep_clone#5250

Merged
majin1102 merged 12 commits intolance-format:mainfrom
majin1102:deep_clone
Dec 22, 2025
Merged

feat: dataset supports deep_clone#5250
majin1102 merged 12 commits intolance-format:mainfrom
majin1102:deep_clone

Conversation

@majin1102
Copy link
Copy Markdown
Contributor

@majin1102 majin1102 commented Nov 17, 2025

Close #5249

@github-actions github-actions Bot added the enhancement New feature or request label Nov 17, 2025
@majin1102 majin1102 marked this pull request as draft November 17, 2025 04:54
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread rust/lance-table/src/format/fragment.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 3, 2025

Codecov Report

❌ Patch coverage is 75.36232% with 51 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 70.00% 30 Missing and 12 partials ⚠️
rust/lance/src/io/commit.rs 84.74% 1 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

@majin1102 majin1102 marked this pull request as ready for review December 3, 2025 12:26
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@majin1102
Copy link
Copy Markdown
Contributor Author

This PR is ready for review @jackye1995

Please take a look if you have time. Thanks

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this, added some comments, and can we also add some tests?

Comment thread rust/lance/src/dataset.rs Outdated

// Resolve source dataset and its manifest using checkout_version
let src_ds = self.checkout_version(version).await?;
let path_specs = self.collect_paths().await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use src_ds?

Comment thread rust/lance/src/dataset.rs Outdated
/// A file path wrapper that can be used to represent a file in a dataset.
/// This wrapper is used for changing the base_path like deep_clone
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct FilePath {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just not have this for now? I agree we probably need some refactoring to have this, but it should be in an independent PR that covers all cases.

vec![]
};
(new_manifest, updated_indices)
// Deep clone: build a manifest that references local files (no external bases)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized one thing, we need to also set base_id for ExternalFile, so those files can be handled for shallow clone case, and also not inherit base_id for deep clone case.

Copy link
Copy Markdown
Contributor Author

@majin1102 majin1102 Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did realize that. But I searched the code and asked @yanghua he told me this has not been used in any write paths yet. And I don't know what the path looks like and how it shoud be copied. So I wrote a check:

https://github.com/majin1102/lance/blob/3823f466a5daa6d9e4eed1ddaa00bb4cc95bd705/rust/lance/src/dataset.rs#L2273

Maybe you have more context to give me?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place I can think of is https://github.com/lance-format/lance/blob/main/rust/lance/src/index/frag_reuse.rs#L162

For stable tow id, we are not doing it today but from spec correctness perspective I think we should probably handle it here. Can be a separated PR though I'm fine with that

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can handle external file in the further PRs.

Comment thread rust/lance/src/dataset.rs Outdated
}

let io_parallelism = self.object_store.io_parallelism();
let copy_futures = path_specs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the best we can do now, but all cloud storage have batch copy, we should ideally leverage that, but that requires upstream support. Maybe create a github issue to track that and add a TODO here.

@majin1102
Copy link
Copy Markdown
Contributor Author

majin1102 commented Dec 8, 2025

thanks for working on this, added some comments, and can we also add some tests?

Oops!
The dataset tests were relocated and I forgot to merge this test_deep_clone when I rebased. Sorry for that!

@majin1102
Copy link
Copy Markdown
Contributor Author

Ready for another look @jackye1995

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good to me, some follow ups I think we can have followup PRs

@github-actions
Copy link
Copy Markdown
Contributor

Code Review: feat: dataset supports deep_clone

Summary

This PR adds a deep_clone method to perform a full server-side copy of all dataset files to a new location. The implementation reuses the existing Operation::Clone infrastructure with an is_shallow: false flag.

P0/P1 Issues

P0: Potential data loss on partial copy failure

In deep_clone(), if any file copy fails midway through the parallel copy stream, partial files may be left at the target location with no cleanup. The error is returned but orphan files remain.

Recommendation: Consider adding cleanup logic on failure, or document this behavior clearly so users know to clean up failed clones manually.


P1: &mut self signature may be overly restrictive

The deep_clone method takes &mut self but does not appear to mutate self - it calls self.checkout_version() which only requires &self. This prevents concurrent deep_clone operations unnecessarily.


P1: External row_id_meta returns error rather than being unsupported gracefully

The collect_paths method returns an internal error for external row_id files. This should probably be Error::NotSupported rather than Error::Internal since it is a user-facing limitation, not an internal invariant violation.


Minor Observations (non-blocking)

  1. The test correctly validates index copying, deletion files, and error cases. Good coverage.

  2. The TODO comment about bulk copy APIs (issue Leverage object store bulk copy to efficiently deep_clone dataset #5435) is appropriate - this is the right place to optimize in the future.

  3. The manifest handling in do_commit_new_dataset properly clears base_paths, branch, tag, and resets base_id values for deep clone, ensuring the cloned dataset is truly independent.

@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: dataset supports deep_clone

Summary

This PR adds deep_clone functionality to copy all dataset files (data, indices, deletions) to a new location, creating a fully independent dataset. The implementation leverages the existing Operation::Clone infrastructure with is_shallow: false.

P0 Issues

1. No cleanup on partial copy failure (data corruption risk)

In deep_clone() (rust/lance/src/dataset.rs:2196-2213), if any file copy fails mid-operation, the target location will contain partially copied files but the commit will not proceed. However, if a subsequent deep_clone is attempted to the same location, it will fail with DatasetAlreadyExists (line 2188) because some files exist, even though no manifest was committed. This creates an orphaned, inconsistent state.

More critically, if the copy phase succeeds but the builder.execute(txn) fails (e.g., due to a transient error), the target will have all data files but no valid manifest, leaving an unusable dataset that cannot be recovered or cleaned up.

Suggested fix: Either (a) implement cleanup on failure, or (b) check for file existence rather than manifest existence when detecting existing targets, or (c) document this behavior and provide a cleanup utility.


P1 Issues

2. Missing test for cross-storage deep_clone

The test only covers local filesystem cloning. Since deep_clone takes store_params for the target and relies on store.copy() which may behave differently across object stores (S3, GCS, Azure), integration tests for cross-storage scenarios would increase confidence.

3. &mut self signature is unexpected for read-like operation

pub async fn deep_clone(
    &mut self,  // <-- This is surprising
    target_path: &str,
    ...

The method only reads from self (via checkout_version) and doesn't mutate the source dataset. Taking &mut self unnecessarily restricts usage. Consider changing to &self.


Minor Observations (non-blocking)

@majin1102 majin1102 merged commit a7741f9 into lance-format:main Dec 22, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support deep_clone like Deta lake

3 participants