feat: support cleanup across branches#5009
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
7d96221 to
1ce78c0
Compare
381e49a to
79d98a9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Something goes wrong after rebasing. I will fix it tomorrow |
|
Add more tests for cornor cases. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
0cfed2b to
58fd8fc
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Thanks for working on this! Some comments added
| } else { | ||
| return Err(Error::Internal { | ||
| message: format!( | ||
| "Branch {} is not referenced by any version from {}", |
There was a problem hiding this comment.
in that case, I think we should cleanup the branch because it is orphan/invalid? Since having a reference from a source branch is required. If that source version is already deleted, this branch should be deleted as well.
| let num_old_manifests = old_manifests.len(); | ||
|
|
||
| // Ideally this collect shouldn't be needed here but it seems necessary | ||
| // Ideally this collect shouldn't be needed here but it sseems necessary |
| pub children: BTreeSet<BranchLineage>, | ||
| } | ||
|
|
||
| impl BranchLineage { |
There was a problem hiding this comment.
I am wondering if this is an over-complication to solve the problem. in my mind, there are 2 things we need:
- get a post-order for the branches to cleanup
- then for each branch, cleanup like a normal dataset. But also we need to pass in the tagged and branch head versions in this dataset.
I don't think we need a full BranchLineage concept to do this. This is an example I requested claude to write, just for reference, but I think it should work:
/// Get cleanup order (post-order) for branches descending from `root`
fn get_descendants_post_order(
branches: &HashMap<String, BranchContents>,
root: Option<&str>,
) -> Vec<String> {
let mut result = Vec::new();
let mut stack = vec![root.map(String::from)];
while let Some(current) = stack.pop() {
// Find children of current
let children: Vec<_> = branches.iter()
.filter(|(_, c)| c.parent_branch == current)
.map(|(name, _)| name.clone())
.collect();
if children.is_empty() {
if let Some(name) = current {
result.push(name);
}
} else {
// Re-push current, then push children
stack.push(current);
for child in children {
stack.push(Some(child));
}
}
}
result
} /// Get versions pinned by child branches
fn get_pinned_versions(
branches: &HashMap<String, BranchContents>,
branch: Option<&str>,
) -> HashSet<u64> {
branches.values()
.filter(|c| c.parent_branch.as_deref() == branch)
.map(|c| c.parent_version)
.collect()
}What do you think?
There was a problem hiding this comment.
I don't think we need a full BranchLineage concept to do this. This is an example I requested claude to write, just for reference, but I think it should work
After re-thinking this through, I use a global branch identifier to make things work.
- Branch Identifier is global identical even if it's deleted unlike branch name. This could avoid id traveling if we delete a branch and create a new one using the same name.
- Use sort or reverse_sort to pre-order or post-order traverse child branches, making traversing simple.
- Potentially we could use branch_id and version_number to globally identifier a snapshot.
- Potentially if we use branch_id to generate the branch path(just use the last uuid, like dataset/tree/uuid), we could safely delete branch and create a new one with the same name(might introduce break change, not implemented yet).
| /// - `children`: ordered set for deterministic traversal. | ||
| #[derive(Debug)] | ||
| pub struct BranchLineage { | ||
| pub deleted: bool, |
There was a problem hiding this comment.
should we really support this? I think because we track branch root instead of branch head, we should enforce that child branch is deleted before parent branch. I understand we want the experience to be similar to git, but git tracks branch by head not by root, and there is no data dependency between child and parent, so it is much more feasible in git than in our case. Even if we want to support this, I feel it should be a much later iteration of the feature after we make the basic cases working.
There was a problem hiding this comment.
I agree that we should as least provide the enforcement in delete_branch. I reveal my thoughts as below.
I understand we want the experience to be similar to git, but git tracks branch by head not by root, and there is no data dependency between child and parent
This is a good top thinking. Let me try to elaborate further on this.
For Git:
- Git uses a
headto track a branch - Potentially the branch lineage relies on the
parent pointerin each commit version - All commit version metadata would not delete(as far as I know). The branch lineage would not lose when we delete a branch.
Iceberg uses similar mechnisim to manage branches. Let's say we have main, branch1 from main, branch2 from branch1. When we delete branch1, the snapshots in branch1 which are not inherited by branch2 would become unreference and be cleaned up in ExpireSnapshots. That says Iceberg supports delete a middle branch and could deal with cleanup correctly.
For Lance:
- Lance uses branch_name:version_number to track a branch
- Potentially the branch lineage relies on the branch root mapping and version sequencial. This drives to Branch Identifier
- Delete branch would delete both branch data and metadata
I think the reasonable action for deleting a branch to make deletion work and keep data available is:
- delete the
BranchContentsfile at first (this is the last operation for now) - Prepare a particular CleanupPolicy for the deleted branch dataset. The policy would only retain the referenced files
- Run the cleanup.
- If the cleanup failed. It could be re-run by a parent branch cleanup by setting clean_referenced_branch true
For this PR, I think it's fine to just throw error when deleting branches with children. I think the deletion action above could be resolved after(or if) we use UUID to generate the branch path to avoid id traveling and name occuping issue(honestly let me look back I wouldn't prefer the branch name to build path although it seems more friendly to oss users and could reuse dataset commit to provide atomicity). On the other hand, introducing BranchIdentifier could be a good chance to add backward and forward compatibility—if we decide to go this route. I’d love to hear your thoughts on this @jackye1995
This will introduce format change. We should remain cautious. I’ll organize my thoughts if necessary, but overall, after this PR, the feature has reached a production-ready state—except for some limitations around branch deletion.
|
Some discussions above. Please take a look with the newest code. |
1185f6b to
1be7302
Compare
Code Review SummaryThis PR implements branch-aware cleanup for Lance datasets, adding the ability to properly track file references across branches and prevent deletion of files still in use by child branches. P0/P1 Issues1. Breaking Change in Branch Deletion (P0) The This is a breaking behavior change that should be clearly documented. Users who relied on being able to delete parent branches will now get errors. Consider:
2. Potential Data Race in In 3. Backwards Compatibility for The pub fn null_branch_identifier() -> BranchIdentifier {
BranchIdentifier::null()
}This means legacy branches without an identifier field will get different random UUIDs on each deserialization. This could cause:
Consider using a deterministic sentinel UUID (e.g., all zeros) for null identifiers instead of random UUIDs. Minor Notes
Reviewed with Claude Code |
PR Review: feat: support cleanup across branchesThanks for working on this feature to support file lifetime management across branches. This is a complex change that adds significant new functionality. Here are my observations: P0 Issues (Must Fix)1. Potential data loss risk in The Consider adding a transaction-like approach or making the inspection phase complete before any deletions occur. 2. Division by zero in if interval != 0 && manifest.version % interval != 0 {
return Ok(None);
}The check P1 Issues (Should Fix)3. Recursive cleanup without depth limit The
Consider adding a maximum depth parameter or iterative approach. 4. Holding mutex lock while doing async I/O in let mut inspection = inspection.lock().unwrap();The lock is held while processing potentially slow I/O operations (reading manifests). This could block other concurrent operations. Consider reducing the critical section to only the data structure modifications. 5. Missing test for The test Minor Observations
Questions
Overall, this is a significant feature addition. The main concerns are around the cascading cleanup behavior and ensuring data safety when errors occur mid-cleanup. Please address the P0 issues before merging. |
PR Review: Support cleanup across branchesThis PR adds cross-branch file lifecycle management for Lance datasets. I've reviewed the changes focusing on correctness, performance, and potential issues. P0 Issues (Critical)1. Potential data loss with In policy.error_if_tagged_old_versions = false;
policy.delete_unverified = false;However, there's no protection against deleting files that are actively being written by ongoing transactions on referenced branches. The cascade cleanup could race with concurrent writes to child branches and delete files that are being referenced. Recommendation: Add documentation clearly warning users this option should only be used when no concurrent writes are happening on any referenced branches, or implement a locking mechanism. 2. Breaking change to The new pub fn null_branch_identifier() -> BranchIdentifier {
BranchIdentifier::null() // generates new Uuid::new_v4()
}This means the same branch file will deserialize to different P1 Issues (High Priority)3. if interval != 0 && manifest.version % interval != 0 {
return Ok(None);
}When if interval == 0 || manifest.version % interval != 0 {
return Ok(None);
}4. Lock contention with In 5. Java test coverage reduced The Java test (
This reduces confidence in the Java bindings' branch lifecycle management. Minor Notes
Overall the architecture is sound for managing cross-branch file references. The |
1247130 to
a23a8e8
Compare
a23a8e8 to
943b460
Compare
|
Sorry I missed this PR lately, will take another pass tomorrow |
369ef92 to
fe503ed
Compare
|
I've pushed some updates today:
Apologies for the multiple pushes, I hope it didn't interrupt any work. @jackye1995 The code is truely ready for review. I know this PR has accumulated quite a bit of context over time, please let me know if anything is unclear or if you'd like me to clarify any part of the changes |
|
I checked out our branch locally, the commit history seems a bit messed up? I see this: looks like somehow there is a squash in the middle, and some new commits after the head. Could you double check it? |
jackye1995
left a comment
There was a problem hiding this comment.
looks good to me! Just 1 nit comment
| stats_guard.old_versions += stats.old_versions; | ||
| } | ||
| } | ||
| Ok::<(), Box<dyn std::error::Error + Send + Sync>>(()) |
There was a problem hiding this comment.
this can just be Ok::<(), lance_core::Error>(()) I think?

Close #4858