Skip to content

feat: support cleanup across branches#5009

Merged
majin1102 merged 5 commits intolance-format:mainfrom
majin1102:branch_clean_up
Feb 9, 2026
Merged

feat: support cleanup across branches#5009
majin1102 merged 5 commits intolance-format:mainfrom
majin1102:branch_clean_up

Conversation

@majin1102
Copy link
Copy Markdown
Contributor

Close #4858

@github-actions github-actions Bot added the enhancement New feature or request label Oct 20, 2025
@majin1102 majin1102 marked this pull request as draft October 20, 2025 16:37
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 88.96157% with 135 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/cleanup.rs 90.26% 22 Missing and 80 partials ⚠️
rust/lance/src/dataset/refs.rs 81.06% 28 Missing and 4 partials ⚠️
rust/lance/src/dataset.rs 83.33% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@jackye1995 jackye1995 self-requested a review October 23, 2025 03:46
@majin1102 majin1102 force-pushed the branch_clean_up branch 2 times, most recently from 381e49a to 79d98a9 Compare November 13, 2025 15:55
@majin1102 majin1102 marked this pull request as ready for review November 13, 2025 16:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread rust/lance/src/dataset/cleanup.rs
@majin1102
Copy link
Copy Markdown
Contributor Author

Something goes wrong after rebasing. I will fix it tomorrow

@majin1102 majin1102 marked this pull request as draft November 13, 2025 17:08
@majin1102 majin1102 marked this pull request as ready for review November 15, 2025 05:58
@majin1102
Copy link
Copy Markdown
Contributor Author

Add more tests for cornor cases.
PTAL when you have time @jackye1995

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread rust/lance/src/dataset/cleanup.rs
@majin1102 majin1102 force-pushed the branch_clean_up branch 2 times, most recently from 0cfed2b to 58fd8fc Compare December 3, 2025 12:24
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Some comments added

Comment thread rust/lance/src/dataset/cleanup.rs
Comment thread rust/lance/src/dataset/cleanup.rs Outdated
} else {
return Err(Error::Internal {
message: format!(
"Branch {} is not referenced by any version from {}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, I think we should cleanup the branch because it is orphan/invalid? Since having a reference from a source branch is required. If that source version is already deleted, this branch should be deleted as well.

Comment thread rust/lance/src/dataset/cleanup.rs Outdated
let num_old_manifests = old_manifests.len();

// Ideally this collect shouldn't be needed here but it seems necessary
// Ideally this collect shouldn't be needed here but it sseems necessary
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo sseems

pub children: BTreeSet<BranchLineage>,
}

impl BranchLineage {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this is an over-complication to solve the problem. in my mind, there are 2 things we need:

  1. get a post-order for the branches to cleanup
  2. then for each branch, cleanup like a normal dataset. But also we need to pass in the tagged and branch head versions in this dataset.

I don't think we need a full BranchLineage concept to do this. This is an example I requested claude to write, just for reference, but I think it should work:

  /// Get cleanup order (post-order) for branches descending from `root`
  fn get_descendants_post_order(
      branches: &HashMap<String, BranchContents>,
      root: Option<&str>,
  ) -> Vec<String> {
      let mut result = Vec::new();
      let mut stack = vec![root.map(String::from)];

      while let Some(current) = stack.pop() {
          // Find children of current
          let children: Vec<_> = branches.iter()
              .filter(|(_, c)| c.parent_branch == current)
              .map(|(name, _)| name.clone())
              .collect();

          if children.is_empty() {
              if let Some(name) = current {
                  result.push(name);
              }
          } else {
              // Re-push current, then push children
              stack.push(current);
              for child in children {
                  stack.push(Some(child));
              }
          }
      }
      result
  }
  /// Get versions pinned by child branches
  fn get_pinned_versions(
      branches: &HashMap<String, BranchContents>,
      branch: Option<&str>,
  ) -> HashSet<u64> {
      branches.values()
          .filter(|c| c.parent_branch.as_deref() == branch)
          .map(|c| c.parent_version)
          .collect()
  }

What do you think?

Copy link
Copy Markdown
Contributor Author

@majin1102 majin1102 Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a full BranchLineage concept to do this. This is an example I requested claude to write, just for reference, but I think it should work

After re-thinking this through, I use a global branch identifier to make things work.

  1. Branch Identifier is global identical even if it's deleted unlike branch name. This could avoid id traveling if we delete a branch and create a new one using the same name.
  2. Use sort or reverse_sort to pre-order or post-order traverse child branches, making traversing simple.
  3. Potentially we could use branch_id and version_number to globally identifier a snapshot.
  4. Potentially if we use branch_id to generate the branch path(just use the last uuid, like dataset/tree/uuid), we could safely delete branch and create a new one with the same name(might introduce break change, not implemented yet).

/// - `children`: ordered set for deterministic traversal.
#[derive(Debug)]
pub struct BranchLineage {
pub deleted: bool,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we really support this? I think because we track branch root instead of branch head, we should enforce that child branch is deleted before parent branch. I understand we want the experience to be similar to git, but git tracks branch by head not by root, and there is no data dependency between child and parent, so it is much more feasible in git than in our case. Even if we want to support this, I feel it should be a much later iteration of the feature after we make the basic cases working.

Copy link
Copy Markdown
Contributor Author

@majin1102 majin1102 Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should as least provide the enforcement in delete_branch. I reveal my thoughts as below.

I understand we want the experience to be similar to git, but git tracks branch by head not by root, and there is no data dependency between child and parent

This is a good top thinking. Let me try to elaborate further on this.

For Git:

  • Git uses a head to track a branch
  • Potentially the branch lineage relies on the parent pointer in each commit version
  • All commit version metadata would not delete(as far as I know). The branch lineage would not lose when we delete a branch.

Iceberg uses similar mechnisim to manage branches. Let's say we have main, branch1 from main, branch2 from branch1. When we delete branch1, the snapshots in branch1 which are not inherited by branch2 would become unreference and be cleaned up in ExpireSnapshots. That says Iceberg supports delete a middle branch and could deal with cleanup correctly.

For Lance:

  • Lance uses branch_name:version_number to track a branch
  • Potentially the branch lineage relies on the branch root mapping and version sequencial. This drives to Branch Identifier
  • Delete branch would delete both branch data and metadata

I think the reasonable action for deleting a branch to make deletion work and keep data available is:

  1. delete the BranchContents file at first (this is the last operation for now)
  2. Prepare a particular CleanupPolicy for the deleted branch dataset. The policy would only retain the referenced files
  3. Run the cleanup.
  4. If the cleanup failed. It could be re-run by a parent branch cleanup by setting clean_referenced_branch true

For this PR, I think it's fine to just throw error when deleting branches with children. I think the deletion action above could be resolved after(or if) we use UUID to generate the branch path to avoid id traveling and name occuping issue(honestly let me look back I wouldn't prefer the branch name to build path although it seems more friendly to oss users and could reuse dataset commit to provide atomicity). On the other hand, introducing BranchIdentifier could be a good chance to add backward and forward compatibility—if we decide to go this route. I’d love to hear your thoughts on this @jackye1995

This will introduce format change. We should remain cautious. I’ll organize my thoughts if necessary, but overall, after this PR, the feature has reached a production-ready state—except for some limitations around branch deletion.

@majin1102
Copy link
Copy Markdown
Contributor Author

Some discussions above. Please take a look with the newest code.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review Summary

This PR implements branch-aware cleanup for Lance datasets, adding the ability to properly track file references across branches and prevent deletion of files still in use by child branches.

P0/P1 Issues

1. Breaking Change in Branch Deletion (P0)

The delete_branch API signature changed from delete(branch) to delete(branch, force), but more importantly, the semantics changed significantly: branches with child branches can no longer be deleted. The Java test was modified to delete branch2 instead of branch1 because branch1 now has a child (branch2) and cannot be deleted.

This is a breaking behavior change that should be clearly documented. Users who relied on being able to delete parent branches will now get errors. Consider:

  • Adding migration documentation
  • Providing a way to force-delete branches with descendants (cascading delete or explicit orphaning)

2. Potential Data Race in process_branch_referenced_manifests (P1)

In cleanup.rs:702-740, the Mutex<CleanupInspection> is held while iterating over fragments and indexes. While individual locks are scoped correctly, the is_referenced flag is set inside the lock but the final old_manifests.retain() only happens if is_referenced is true. If an error occurs between setting files as referenced and the retain call, the cleanup inspection could be in an inconsistent state.

3. Backwards Compatibility for BranchIdentifier (P1)

The null_branch_identifier() default function creates a random UUID each time it's called (refs.rs:758-761):

pub fn null_branch_identifier() -> BranchIdentifier {
    BranchIdentifier::null()
}

This means legacy branches without an identifier field will get different random UUIDs on each deserialization. This could cause:

  • Incorrect parent-child relationship detection
  • Non-deterministic behavior in tests

Consider using a deterministic sentinel UUID (e.g., all zeros) for null identifiers instead of random UUIDs.

Minor Notes

  • Good test coverage with comprehensive lineage tests
  • The interval != 0 check added to prevent division by zero is a nice fix
  • Clippy improvement: &Vec<IndexMetadata>&[IndexMetadata]

Reviewed with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: support cleanup across branches

Thanks for working on this feature to support file lifetime management across branches. This is a complex change that adds significant new functionality. Here are my observations:

P0 Issues (Must Fix)

1. Potential data loss risk in retain_branch_lineage_files

The retain_branch_lineage_files function iterates through referenced branches and moves files from verified_files to referenced_files. However, there's a concern: if a manifest read fails in process_branch_referenced_manifests, the error propagates up and could cause the cleanup to fail after already having cleaned some branches (via clean_referenced_branches). This could leave the dataset in an inconsistent state.

Consider adding a transaction-like approach or making the inspection phase complete before any deletions occur.

2. Division by zero in build_cleanup_policy

if interval != 0 && manifest.version % interval != 0 {
    return Ok(None);
}

The check interval != 0 was added, which is good. However, the condition evaluates to Ok(None) (no cleanup) when interval == 0, which might be counterintuitive. Users setting interval=0 might expect cleanup to run every commit. Consider documenting this behavior or treating interval=0 as an error.

P1 Issues (Should Fix)

3. Recursive cleanup without depth limit

The clean_referenced_branches function can trigger cascading cleanups across branches. If branches form a deep chain (A → B → C → D...), this could lead to:

  • Stack overflow in deeply nested scenarios
  • Unbounded execution time
  • Holding locks/resources for extended periods

Consider adding a maximum depth parameter or iterative approach.

4. Holding mutex lock while doing async I/O in process_branch_referenced_manifests

let mut inspection = inspection.lock().unwrap();

The lock is held while processing potentially slow I/O operations (reading manifests). This could block other concurrent operations. Consider reducing the critical section to only the data structure modifications.

5. Missing test for clean_referenced_branches error handling

The test test_branch_cleanup_with_descendants is comprehensive but doesn't test the error path when clean_referenced_branches fails mid-way through cleaning multiple branches.

Minor Observations

  • The old_manifests changed from Vec<Path> to HashMap<Path, Manifest>, which increases memory usage as manifests are now retained. This may be necessary for the new functionality but worth noting for large datasets with many old versions.

  • The BranchIdentifier struct and collect_referenced_versions method in refs.rs (inferred from the diff) should have documentation explaining the branch lineage traversal algorithm.

  • The Java CleanupPolicy.Builder documentation says "clean referenced branches before clean the current branch" - consider rewording for clarity: "If true, automatically clean referenced branches before cleaning the current branch."

Questions

  1. What's the expected behavior when a referenced branch has already been deleted or is inaccessible?
  2. Should there be a dry-run mode to preview which branches would be affected before running cleanup?

Overall, this is a significant feature addition. The main concerns are around the cascading cleanup behavior and ensuring data safety when errors occur mid-cleanup. Please address the P0 issues before merging.

@github-actions
Copy link
Copy Markdown
Contributor

PR Review: Support cleanup across branches

This PR adds cross-branch file lifecycle management for Lance datasets. I've reviewed the changes focusing on correctness, performance, and potential issues.

P0 Issues (Critical)

1. Potential data loss with clean_referenced_branches enabled

In cleanup.rs:clean_referenced_branches(), when cleaning referenced branches, the policy is modified:

policy.error_if_tagged_old_versions = false;
policy.delete_unverified = false;

However, there's no protection against deleting files that are actively being written by ongoing transactions on referenced branches. The cascade cleanup could race with concurrent writes to child branches and delete files that are being referenced.

Recommendation: Add documentation clearly warning users this option should only be used when no concurrent writes are happening on any referenced branches, or implement a locking mechanism.

2. Breaking change to BranchContents serialization

The new identifier field with #[serde(default = "null_branch_identifier")] generates a random UUID on deserialization of old branch contents:

pub fn null_branch_identifier() -> BranchIdentifier {
    BranchIdentifier::null()  // generates new Uuid::new_v4()
}

This means the same branch file will deserialize to different identifier values on each read, which could cause inconsistent behavior when comparing identifiers across different read operations. The identifier should be deterministic based on the branch name or stored data.

P1 Issues (High Priority)

3. interval=0 causes infinite loop in build_cleanup_policy

if interval != 0 && manifest.version % interval != 0 {
    return Ok(None);
}

When interval=0, this condition passes and the cleanup proceeds, but setting interval=0 likely indicates "never auto-cleanup". Consider treating interval=0 as "disabled":

if interval == 0 || manifest.version % interval != 0 {
    return Ok(None);
}

4. Lock contention with Mutex<CleanupInspection>

In retain_branch_lineage_files(), a single Mutex<CleanupInspection> is held across all concurrent manifest processing. Since process_branch_referenced_manifests() does I/O while holding the lock (via lock().unwrap() after async operations), this serializes what should be parallel work. Consider collecting results separately and merging at the end.

5. Java test coverage reduced

The Java test (DatasetTest.java) significantly reduces test coverage by removing the checkout verification after branch deletion:

  • Old test: verified branch2 remained after deleting branch1, then tested checkout
  • New test: only verifies list size and deletes branch2 first

This reduces confidence in the Java bindings' branch lifecycle management.

Minor Notes

  • The old_manifests type change from Vec<Path> to HashMap<Path, Manifest> increases memory usage proportionally to the number of old manifests. For datasets with many versions, this could be significant.

  • The MockObjectStore struct in tests is made pub but appears to only be used in tests - consider pub(crate) or keeping it private with #[cfg(test)].

Overall the architecture is sound for managing cross-branch file references. The BranchIdentifier design using version chains is clever. Please address the P0 issues before merging, particularly the non-deterministic identifier problem which could cause subtle bugs.

@jackye1995
Copy link
Copy Markdown
Contributor

Sorry I missed this PR lately, will take another pass tomorrow

@majin1102
Copy link
Copy Markdown
Contributor Author

majin1102 commented Feb 7, 2026

I've pushed some updates today:

  • Parallelized the branch cleaning process for better performance.
  • Polished comments and naming for improved readability.

Apologies for the multiple pushes, I hope it didn't interrupt any work. @jackye1995

The code is truely ready for review. I know this PR has accumulated quite a bit of context over time, please let me know if anything is unclear or if you'd like me to clarify any part of the changes

@jackye1995
Copy link
Copy Markdown
Contributor

I checked out our branch locally, the commit history seems a bit messed up?

I see this:

0d49b66e2 (HEAD -> branch_clean_up) fix clippy after rebasing
f16c54510 Merge branch 'main' into branch_clean_up
772c6d87a ci: run workflows also on release branch (#5398)
052e01885 refactor: split dataset tests in a tests mod (#5387)
14087e7fd feat(blob_v2): add external blob support (#5385)
58fd8fcff squash
c71607eb8 chore: remove lancedb in github discussion links and java pom file (#5394)
d9dcd8a5c fix: respect index metric when user overrides (#5395)
926156226 fix: don't allow change blob version during update (#5386)
72075b84b refactor: rename RowIdTreeMap to RowAddrTreeMap (#5266)
6c50d977b chore: bump main to 1.1.0-beta.0
88e44b1ab refactor!: deprecate mac x86 support (#5391)
6381f7c65 fix: add graceful shutdown and start for rest namespace adapter (#5325)
e7379a0fa chore: polish agents.md for better behavior (#5383)
6e458989a perf: avoid allocating filtered nodes on HNSW search path (#5377)
b7024fa37 chore: update lock file for python binding (#5376)
c3600d489 feat(java): support writing schema metadata through java LanceFileWriter API (#5310)
ff89675eb (tag: v1.0.0-beta.16) chore: release beta version 1.0.0-beta.16

looks like somehow there is a squash in the middle, and some new commits after the head. Could you double check it?

@majin1102
Copy link
Copy Markdown
Contributor Author

majin1102 commented Feb 9, 2026

Sorry about the confusion. After that squash commit, I performed a lengthy rebase and had already resolved the conflicts beforehand.
To avoid inconsistencies, I recommend either:

  1. Deleting your local branch and checking out again from the updated remote, or
  2. Resetting your HEAD to the commit just before the squash, then pulling the latest changes. (not sure if there are any stale commit before that squash commit)

The issue you're seeing is likely because your local branch was based on the old version of the squash commit. If you do a fresh pull from a clean state, you should see the rebased version line(NOTE: the squash commit has a different revision from your local).

image

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! Just 1 nit comment

Comment thread rust/lance/src/dataset/cleanup.rs Outdated
stats_guard.old_versions += stats.old_versions;
}
}
Ok::<(), Box<dyn std::error::Error + Send + Sync>>(())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can just be Ok::<(), lance_core::Error>(()) I think?

@majin1102 majin1102 merged commit b360da1 into lance-format:main Feb 9, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support file lifetime management across branches

3 participants