Skip to content

Conversation

@GreenHatHG
Copy link
Contributor

@GreenHatHG GreenHatHG commented Dec 25, 2025

Summary

Fixes #10585

  1. Log paths during dry-run garbage collection.
  2. Add ODB deduplication logic to prevent scanning the same object database multiple times when repo cache and local cache point to the same location.

Example output

WARNING: This will remove all cache except items used in the workspace of the current repo.
Removing /tmp/pytest-of-jooooody/pytest-15/test_gc_dry_run_report_output0/.dvc/cache/files/md5/93/3b0b3162b40298a8e961e8dd238a11.dir
Removing /tmp/pytest-of-jooooody/pytest-15/test_gc_dry_run_report_output0/.dvc/cache/files/md5/f2/7f5596d752510b7b1e97e2e1870a45
Removing /tmp/pytest-of-jooooody/pytest-15/test_gc_dry_run_report_output0/.dvc/cache/files/md5/32/fc8a4605bce98cdac4bf5e3edc882e
Removed 3 objects from repo cache.
No unused 'legacy' cache to remove.

Run this unit test to quickly check the output:

def test_gc_dry_run_report_output(tmp_dir, dvc, caplog):
    # Garbage object 1: A standalone file
    (garbage_stage,) = tmp_dir.dvc_gen("garbage_file", "this is garbage")

    # Garbage objects 2 & 3: A directory and its inner content
    (garbage_dir_stage,) = tmp_dir.dvc_gen({"garbage_dir": {"f": "in dir"}})

    os.remove(garbage_stage.relpath)
    os.remove(garbage_dir_stage.relpath)

    with caplog.at_level(logging.INFO, logger="dvc"):
        ret = main(["gc", "-w", "--dry"])
        assert ret == 0

    print("***** captured *****")
    print(caplog.text)
    print("*" * 20)

Dependencies

Requires updated dvc-data (PR: treeverse/dvc-data#650)

@codecov
Copy link

codecov bot commented Dec 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.99%. Comparing base (2431ec6) to head (c88963c).
⚠️ Report is 182 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10937      +/-   ##
==========================================
+ Coverage   90.68%   90.99%   +0.31%     
==========================================
  Files         504      505       +1     
  Lines       39795    41096    +1301     
  Branches     3141     3257     +116     
==========================================
+ Hits        36087    37395    +1308     
- Misses       3042     3063      +21     
+ Partials      666      638      -28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@GreenHatHG
Copy link
Contributor Author

Hi maintainers,

The CI is failing as expected because this work depends on an unmerged PR in dvc-data which introduces the is_dir_hash function.
treeverse/dvc-data#650

I'll convert this to a Draft PR until the dependency is merged.

@GreenHatHG GreenHatHG marked this pull request as draft December 25, 2025 10:33
@skshetry
Copy link
Collaborator

skshetry commented Jan 3, 2026

Hi, I think you are complicating the feature and implementation a lot. I get the intent behind separating collection and removal, but at this stage it feels like too much work for limited gain.

There's also no guarantee we'll always be able to maintain that separation, for example, if we implement #829, separating collection from removal may not be feasible.

I don’t think we need tables here. The dir/file distinction and oid are internal implementation details that we don’t expose to users, and I don’t see “Modified” as particularly meaningful for a content-addressable storage. The only field that really matters to users is the path (and maybe the count of objects that will be deleted).

While size information can be useful, since we’re dealing with garbage objects it may add unnecessary overhead.

If dvc_data.hashfile.gc.gc() doesn’t currently provide paths, we could consider breaking the API to return them instead of just a file count, and then display those paths here.

Alternatively, we could just log them inside gc() as they are deleted.

@GreenHatHG
Copy link
Contributor Author

Hi, I think you are complicating the feature and implementation a lot. I get the intent behind separating collection and removal, but at this stage it feels like too much work for limited gain.

There's also no guarantee we'll always be able to maintain that separation, for example, if we implement #829, separating collection from removal may not be feasible.

I don’t think we need tables here. The dir/file distinction and oid are internal implementation details that we don’t expose to users, and I don’t see “Modified” as particularly meaningful for a content-addressable storage. The only field that really matters to users is the path (and maybe the count of objects that will be deleted).

While size information can be useful, since we’re dealing with garbage objects it may add unnecessary overhead.

If dvc_data.hashfile.gc.gc() doesn’t currently provide paths, we could consider breaking the API to return them instead of just a file count, and then display those paths here.

Alternatively, we could just log them inside gc() as they are deleted.

Hi! Thanks for the feedback - you're right that this is overcomplicated. I want to simplify it following your suggestions.

Quick question: when you said "log them inside gc() as they are deleted", did you mean adding logging to dvc_data.hashfile.gc.gc()?

Because if I only modify the DVC layer, I'd have to call iter_garbage() separately to get the paths, which brings back the separation issue you mentioned.

Just want to confirm before I start making changes. Thanks!

@skshetry
Copy link
Collaborator

skshetry commented Jan 5, 2026

Quick question: when you said "log them inside gc() as they are deleted", did you mean adding logging to dvc_data.hashfile.gc.gc()?

yes, just adding logger.info() calls inside the gc() function if dry=True. If I am not wrong, those logs should show up in the output.

for path in (dir_paths, file_paths):
	if dry:
        logger.info("Removing", path)
    else:
	    odb.fs.remove(path)

Alternatively, we can also break gc API. It's not a big deal, as we limit dvc-data to the next minor version.

"dvc-data>=3.17.0,<3.18",

@GreenHatHG
Copy link
Contributor Author

Quick question: when you said "log them inside gc() as they are deleted", did you mean adding logging to dvc_data.hashfile.gc.gc()?

yes, just adding logger.info() calls inside the gc() function if dry=True. If I am not wrong, those logs should show up in the output.

for path in (dir_paths, file_paths):
	if dry:
        logger.info("Removing", path)
    else:
	    odb.fs.remove(path)

Alternatively, we can also break gc API. It's not a big deal, as we limit dvc-data to the next minor version.

"dvc-data>=3.17.0,<3.18",

Hi! I've implemented the logging approach as you suggested. The paths are now logged inside dvc_data.hashfile.gc.gc() when dry=True.

However, I noticed that the output is duplicated:

WARNING: This will remove all cache except items used in the workspace of the current repo.
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/93/3b0b3162b40298a8e961e8dd238a11.dir
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/f2/7f5596d752510b7b1e97e2e1870a45
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/32/fc8a4605bce98cdac4bf5e3edc882e
Removed 3 objects from repo cache.
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/93/3b0b3162b40298a8e961e8dd238a11.dir
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/f2/7f5596d752510b7b1e97e2e1870a45
Removing /tmp/pytest-of-jooooody/pytest-3/test_gc_dry_run_report_output0/.dvc/cache/files/md5/32/fc8a4605bce98cdac4bf5e3edc882e
Removed 3 objects from local cache.
No unused 'legacy' cache to remove.

This happens because repo cache and local cache often point to the same ODB instance, so we scan and log the same directory twice.

In my previous implementation, I had a _iter_unique_odbs() helper function to deduplicate ODBs before scanning:

def _iter_unique_odbs(odbs):
    """
    The local cache and repo cache often point to the same ObjectDB instance.
    Without deduplication, we would scan the same directory twice
    """
    seen = set()
    for scheme, odb in odbs:
        if odb and odb not in seen:
            seen.add(odb)
            yield scheme, odb

So I think the fix is to deduplicate ODBs before calling dvc.repo.gc.gc(). Something like:

 seen_odbs = set()
  for scheme, odb in self.cache.by_scheme():
      if not odb or odb in seen_odbs:
          continue
      seen_odbs.add(odb)
      num_removed = ogc(odb, used_obj_ids, jobs=jobs, dry=dry)
      # ...

Does this approach look good to you?

@skshetry
Copy link
Collaborator

skshetry commented Jan 6, 2026

Hi, we can fix that issue separately by keeping a set of seen odbs (odbs are hashable based on their path) and skipping those that we have encountered before.

https://github.com/treeverse/dvc-objects/blob/f0f73bb2f7ee0e8d08c3cb0213d08086acdf2e01/src/dvc_objects/db.py#L55

@GreenHatHG GreenHatHG force-pushed the feat/gc-dry-detailed-output branch from 762bb32 to 5a93523 Compare January 7, 2026 11:21
@GreenHatHG GreenHatHG changed the title gc: implement detailed report for --dry run gc: log paths during dry-run garbage collection Jan 7, 2026
@GreenHatHG GreenHatHG marked this pull request as ready for review January 7, 2026 12:14
@skshetry skshetry merged commit ff8752c into treeverse:main Jan 7, 2026
47 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in DVC Jan 7, 2026
@skshetry
Copy link
Collaborator

skshetry commented Jan 7, 2026

Thank you @GreenHatHG. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gc --dry does not show what files are going to be removed

2 participants