Skip to content

chore: speedup modular converter (~30%)#45046

Merged
tarekziade merged 7 commits intomainfrom
tarekziade-modular-speedup
Apr 27, 2026
Merged

chore: speedup modular converter (~30%)#45046
tarekziade merged 7 commits intomainfrom
tarekziade-modular-speedup

Conversation

@tarekziade
Copy link
Copy Markdown
Collaborator

@tarekziade tarekziade commented Mar 27, 2026

What does this PR do?

First optimization

Adding a module-source cache in utils/modular_model_converter.py. When the converter needs the same imported modeling file more than once, it now reuses the already-read source and parsed LibCST tree, invalidated by the file’s mtime, instead of reopening and reparsing the file.

Second optimization

Removing the metadata-heavy visit path and making the lightweight traversal the default.

Instead of asking LibCST metadata whether a node is module-level and what its source position is, we track suite depth and node order directly during traversal, which avoids the MetadataWrapper overhead.

Together, those changes cut runtime by about 30% on the sampled batch. (3 runs on qwen3, olmo2, starcoder2)

@tarekziade tarekziade self-assigned this Mar 27, 2026
@tarekziade tarekziade marked this pull request as draft March 27, 2026 08:32
@tarekziade
Copy link
Copy Markdown
Collaborator Author

  python utils/profile_modular_conversion.py \
    src/transformers/models/conditional_detr/modular_conditional_detr.py \
    --repeat 1 \
    --compare-old-wrapper \
    --verify-equal \
    --snakeviz \
    --output-dir /tmp/codex-profiles/conditional_detr_modular_profile

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tarekziade
Copy link
Copy Markdown
Collaborator Author

So far on a 7-file sample:

before: 13.863s
after: 7.032s

49.3% faster.

@Cyrilvallez
Copy link
Copy Markdown
Member

Cyrilvallez commented Apr 23, 2026

Hey! Very nice to start exploring conversion speed!
I did not profile myself or run your profiler scripts, so would just like to confirm here what are the worst points to tackle first. Because I see some very nice and easy wins here, but others bring harder to follow code that may not prove to be so maintainable.

The main points I see from the PR, and the tradeoffs:

  • initial caching of the modules for reuse -> nice and easy in theory, but we start many processes to run conversion in parallel, so the different processes will NOT share their cache -> did you check it actually makes any difference? I.e. once a process in the pool starts, it will be able to keep its cache when starting with another input?
  • Always keeping track of the node depth and order in the file instead of calling self.get_metadata(cst.metadata.ParentNodeProvider, node) -> nice and easy win, should improve quite a bit
  • imports -> hard to read/understand -> does it really provide big speedups?

The most obvious point is see here is keeping track of depth/order directly instead of calling get_metadata (and thus removing the MetadataWrapper) which is quite slow as cst needs to keep track of much more stuff. Is it the biggest source of improvement?

@tarekziade
Copy link
Copy Markdown
Collaborator Author

Of the three optimisations, this is the benchmark:

  • Module cache. Yes, it helps even in the worst case where each pool worker gets a single file (~3–6% from intra-file dep-graph re-parsing). Across multiple files sharing a base, the benefit grows (~9% here on a random 12-file batch). It's the smallest of the three wins though.
  • Fast mapper visit (depth/order tracking, no ParentNodeProvider/MetadataWrapper), That's the biggest win by far, ~55% of the total speedup.
  • Fast import analysis (symtable + AST instead of CST ScopeProvider) It's ~16% of the total, ~2s on this sample.

So yeah, keeping fast mappers would rip most of the benefits. Agreed the fast import analysis adds complexity.

For the module cache, it's a small win but I don't think it adds complexity. I've experimented with dumping the CST tree on disk and reload it, but it's not a huge gain so I dropped the idea. cross-processes cache is also not useful since the cst work is pretty isolated per process. I also tried to raised the file-per-worker ratio but that does not help.

I've spotted another place to cache: cache the dependency analysis per base module (not just the parsed tree).
Will try that

@tarekziade
Copy link
Copy Markdown
Collaborator Author

I've spotted another place to cache: cache the dependency analysis per base module (not just the parsed tree).
Will try that

nope... dead end.

the current version is 44.7% faster

Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like this version better, we keep the code "simple" while keeping most of the bemefits!

Comment on lines +68 to +71
mtime_ns = os.stat(file_path).st_mtime_ns
cached = _MODULE_SOURCE_CACHE.get(cache_key)
if ENABLE_MODULE_SOURCE_CACHE and cached is not None and cached[0] == mtime_ns:
return cached[1], cached[2]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we cache the time here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not the time it's the file's st_mtime_ns that we use to determine if it was modified. It's used for a small performance gain and also the profiler.

The gains for the execution side are minimal, if we remove the profiler we can probably rip off this part as well

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15% gain on reads

Comment thread utils/modular_model_converter.py Outdated
Comment on lines +52 to +54
ENABLE_MODULE_SOURCE_CACHE = True
ENABLE_FAST_MAPPER_VISIT = True
_MODULE_SOURCE_CACHE = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it always True no? Wdyt?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, these were here to enable side-by-side comparisons with the benchmark script.

@tarekziade
Copy link
Copy Markdown
Collaborator Author

Nice! I like this version better, we keep the code "simple" while keeping most of the bemefits!

ok I'll remove the profiler script and make this a landable PR thanks for your reviews!

@tarekziade tarekziade force-pushed the tarekziade-modular-speedup branch from 1fabbfd to fa786d8 Compare April 24, 2026 07:12
@tarekziade tarekziade changed the title investigate modular conversion speedups chore: speedup modular converter (~30%) Apr 24, 2026
@tarekziade tarekziade marked this pull request as ready for review April 24, 2026 07:17
Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Let's simply make sure we don't see any diffs anywhere by running
python utils/modular_model_converter.py all and python utils/modular_model_converter.py examples (the examples behave as tests for some edge cases) before merging, then let's go!

Comment thread utils/modular_model_converter.py Outdated
Comment on lines +1219 to +1238
def get_needed_imports(body: dict[str, dict], all_imports: list[cst.CSTNode]) -> list[cst.CSTNode]:
"""Get all the imports needed in the `body`, from the list of `all_imports`.
`body` is a dict with the following structure `{str: {"insert_idx": int, "node": cst.CSTNode}}`.
"""
new_body = [k[1]["node"] for k in sorted(body.items(), key=lambda x: x[1]["insert_idx"])]
wrapper = MetadataWrapper(cst.Module(body=all_imports + new_body), unsafe_skip_copy=True)
scopes = set(wrapper.resolve(ScopeProvider).values())
import_ref_count = defaultdict(lambda: 0)
for scope in scopes:
for assignment in scope.assignments:
node = assignment.node
if isinstance(assignment, cst.metadata.Assignment) and isinstance(node, (cst.Import, cst.ImportFrom)):
ref_count = len(assignment.references)
name = assignment.name
import_ref_count[name] = max(ref_count, import_ref_count[name])
# Similar imports may be redefined, and only used between their 1st and 2nd definition so if we already have
# a ref count > 0 at any point, the imports is actually used
unused_imports = {name for name, count in import_ref_count.items() if count <= 0 or name in body}
return _build_needed_imports(all_imports, unused_imports)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if it's 100% the same would prefer to quickly revert as well to simplify history of the file!

@tarekziade tarekziade force-pushed the tarekziade-modular-speedup branch from fa786d8 to 82ba8df Compare April 27, 2026 06:37
@tarekziade tarekziade enabled auto-merge April 27, 2026 06:39
@tarekziade tarekziade added this pull request to the merge queue Apr 27, 2026
Merged via the queue into main with commit 99cdd43 Apr 27, 2026
19 checks passed
@tarekziade tarekziade deleted the tarekziade-modular-speedup branch April 27, 2026 06:54
ArthurZucker pushed a commit that referenced this pull request Apr 28, 2026
* investigate modular conversion speedups

* second optim / cache

* Python-only fast path based on symtable plus a lightweight AST pass

* replaced the imported-module analysis path in convert_modular_file() with a metadata-free mapper visit that tracks top-level context and source order directly, eliminating expensive LibCST ParentNodeProvider/PositionProvider passes while preserving byte-identical output

* revert some speedups

* removed the profiler and cleanup the converter touse both changes

* make the change closer to main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants