feat: dataset supports deep_clone#5250
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
be7b215 to
2ff4807
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
This PR is ready for review @jackye1995 Please take a look if you have time. Thanks |
jackye1995
left a comment
There was a problem hiding this comment.
thanks for working on this, added some comments, and can we also add some tests?
|
|
||
| // Resolve source dataset and its manifest using checkout_version | ||
| let src_ds = self.checkout_version(version).await?; | ||
| let path_specs = self.collect_paths().await?; |
| /// A file path wrapper that can be used to represent a file in a dataset. | ||
| /// This wrapper is used for changing the base_path like deep_clone | ||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| pub struct FilePath { |
There was a problem hiding this comment.
can we just not have this for now? I agree we probably need some refactoring to have this, but it should be in an independent PR that covers all cases.
| vec![] | ||
| }; | ||
| (new_manifest, updated_indices) | ||
| // Deep clone: build a manifest that references local files (no external bases) |
There was a problem hiding this comment.
I just realized one thing, we need to also set base_id for ExternalFile, so those files can be handled for shallow clone case, and also not inherit base_id for deep clone case.
There was a problem hiding this comment.
I did realize that. But I searched the code and asked @yanghua he told me this has not been used in any write paths yet. And I don't know what the path looks like and how it shoud be copied. So I wrote a check:
Maybe you have more context to give me?
There was a problem hiding this comment.
Another place I can think of is https://github.com/lance-format/lance/blob/main/rust/lance/src/index/frag_reuse.rs#L162
For stable tow id, we are not doing it today but from spec correctness perspective I think we should probably handle it here. Can be a separated PR though I'm fine with that
There was a problem hiding this comment.
Yeah, I think we can handle external file in the further PRs.
| } | ||
|
|
||
| let io_parallelism = self.object_store.io_parallelism(); | ||
| let copy_futures = path_specs |
There was a problem hiding this comment.
I guess this is the best we can do now, but all cloud storage have batch copy, we should ideally leverage that, but that requires upstream support. Maybe create a github issue to track that and add a TODO here.
Oops! |
|
Ready for another look @jackye1995 |
jackye1995
left a comment
There was a problem hiding this comment.
Overall looking good to me, some follow ups I think we can have followup PRs
Code Review: feat: dataset supports deep_cloneSummaryThis PR adds a deep_clone method to perform a full server-side copy of all dataset files to a new location. The implementation reuses the existing Operation::Clone infrastructure with an is_shallow: false flag. P0/P1 IssuesP0: Potential data loss on partial copy failure In deep_clone(), if any file copy fails midway through the parallel copy stream, partial files may be left at the target location with no cleanup. The error is returned but orphan files remain. Recommendation: Consider adding cleanup logic on failure, or document this behavior clearly so users know to clean up failed clones manually. P1: &mut self signature may be overly restrictive The deep_clone method takes &mut self but does not appear to mutate self - it calls self.checkout_version() which only requires &self. This prevents concurrent deep_clone operations unnecessarily. P1: External row_id_meta returns error rather than being unsupported gracefully The collect_paths method returns an internal error for external row_id files. This should probably be Error::NotSupported rather than Error::Internal since it is a user-facing limitation, not an internal invariant violation. Minor Observations (non-blocking)
|
PR Review: feat: dataset supports deep_cloneSummaryThis PR adds P0 Issues1. No cleanup on partial copy failure (data corruption risk) In More critically, if the copy phase succeeds but the Suggested fix: Either (a) implement cleanup on failure, or (b) check for file existence rather than manifest existence when detecting existing targets, or (c) document this behavior and provide a cleanup utility. P1 Issues2. Missing test for cross-storage deep_clone The test only covers local filesystem cloning. Since 3. pub async fn deep_clone(
&mut self, // <-- This is surprising
target_path: &str,
...The method only reads from Minor Observations (non-blocking)
|
Close #5249