Skip to content

refactoring: speedup static checks with disk cache#44992

Merged
tarekziade merged 12 commits intomainfrom
speedup-check-docstrings
Mar 31, 2026
Merged

refactoring: speedup static checks with disk cache#44992
tarekziade merged 12 commits intomainfrom
speedup-check-docstrings

Conversation

@tarekziade
Copy link
Copy Markdown
Collaborator

@tarekziade tarekziade commented Mar 25, 2026

What does this PR do?

make check-repo can be quite slow, this patch adds file-level cache to speed up checks.

We get up to a 27x speedup

  • cold cache : 46s
  • warm cache : 1.6s

@tarekziade tarekziade requested a review from ydshieh March 25, 2026 11:40
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ydshieh
Copy link
Copy Markdown
Collaborator

ydshieh commented Mar 25, 2026

as discussed on slack, let's separate the PR into multiple ones :-)

For check_docstrings, it would be nice to add some short docstring for each newly added test in tests/repo_utils/test_check_docstrings.py, like

    def test_ignores_other_decorators(self):
        """ what this really want to check (and maybe also why it should be the expected output you specified) """
        path = self._write_temp("@dataclass\nclass Baz:\n    pass\n")
        names = _get_auto_docstring_names(path)
        self.assertEqual(names, set())

(I don't mind AI generate docstring, as long as you have an 👀 reading it first 🙏 )

For some tests, we could guess what they are from the test body, but it's not always clear.

(I know, we rarely have docstring in test methods ... in general 😢 )

@tarekziade
Copy link
Copy Markdown
Collaborator Author

splited out the doctsring refactoring in a different PR #45009

@tarekziade tarekziade force-pushed the speedup-check-docstrings branch from fee57c5 to 6ae3f7c Compare March 26, 2026 05:51
@tarekziade
Copy link
Copy Markdown
Collaborator Author

splitted out changes to check_repo in #45012

Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and thank you for the iteration! Going to 🚀

@tarekziade tarekziade self-assigned this Mar 26, 2026
@tarekziade tarekziade force-pushed the speedup-check-docstrings branch from 103fbe0 to 0af5214 Compare March 30, 2026 10:51
@tarekziade tarekziade changed the title refactoring: speedup static checks refactoring: speedup static checks with disk cache Mar 30, 2026
@ydshieh
Copy link
Copy Markdown
Collaborator

ydshieh commented Mar 30, 2026

Thanks for the work, this is a nice speedup!

One concern: CHECKER_FILE_GLOBS in checkers.py is a manually maintained map that approximates what each checker script actually scans internally. But each checker has its own file discovery logic, and there's no guarantee these globs stay in sync with it. If a checker script starts scanning additional files and CHECKER_FILE_GLOBS isn't updated accordingly, the cache will silently say "current" and skip the checker — meaning a real issue goes undetected with no warning.

It's also not clear from the code whether these globs are a faithful 1:1 replication of each checker's internal file discovery, or just a best-effort approximation. If approximations, this fragility should at least be prominently documented.

@tarekziade
Copy link
Copy Markdown
Collaborator Author

Thanks for the work, this is a nice speedup!

One concern: CHECKER_FILE_GLOBS in checkers.py is a manually maintained map that approximates what each checker script actually scans internally. But each checker has its own file discovery logic, and there's no guarantee these globs stay in sync with it. If a checker script starts scanning additional files and CHECKER_FILE_GLOBS isn't updated accordingly, the cache will silently say "current" and skip the checker — meaning a real issue goes undetected with no warning.

It's also not clear from the code whether these globs are a faithful 1:1 replication of each checker's internal file discovery, or just a best-effort approximation. If approximations, this fragility should at least be prominently documented.

Thanks, that’s a very good point, and I agree the failure mode here is particularly risky since it can silently skip checks. I like the idea of reducing the chance of drift by moving the glob definition next to each checker, so the file discovery logic and the cache inputs live in the same place. That should make changes much harder to miss.

I’ll also double-check whether these globs are currently a strict reflection of each checker’s behavior or just approximations. If there’s any approximation involved, I’ll make that explicit in the code and add documentation calling out the risk.

Longer term, it might be worth exploring whether we can derive the inputs directly from the checker logic instead of duplicating it, to avoid this class of issue entirely.

@tarekziade tarekziade force-pushed the speedup-check-docstrings branch from 6c2aacc to ce2de86 Compare March 30, 2026 15:31
@tarekziade
Copy link
Copy Markdown
Collaborator Author

I should also add that there are no cache on CI side so we will not miss a check failure prior to merging

Copy link
Copy Markdown
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nit, then we can go!

Super nice 🔥

Comment thread tests/repo_utils/test_checkers.py Outdated
Comment thread utils/checkers.py Outdated
Comment thread utils/checkers.py
@tarekziade tarekziade force-pushed the speedup-check-docstrings branch from 03e8290 to b88ec56 Compare March 31, 2026 12:12
@tarekziade tarekziade added this pull request to the merge queue Mar 31, 2026
Merged via the queue into main with commit 3a4a662 Mar 31, 2026
20 checks passed
@tarekziade tarekziade deleted the speedup-check-docstrings branch March 31, 2026 12:34
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Mar 31, 2026
* added a cache

* use a shared cache

* added cache in repo checker

* more caching

* more caching

* display elapsed times

* added test coverage

* Remove check_repo cache changes from speedup branch

* fix bad merge

* per-checker glob

* fixed a couple of tests

* tweaks
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Apr 4, 2026
* added a cache

* use a shared cache

* added cache in repo checker

* more caching

* more caching

* display elapsed times

* added test coverage

* Remove check_repo cache changes from speedup branch

* fix bad merge

* per-checker glob

* fixed a couple of tests

* tweaks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants