Fixing two cascading bugs when running the CK MoE tuner#2464
Open
xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
Open
Fixing two cascading bugs when running the CK MoE tuner#2464xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
Conversation
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
3 tasks
azaidy
added a commit
that referenced
this pull request
May 4, 2026
@xaguilar-amd asked to drop #2464 (CK MoE tuner bug fixes) from this bulk merge — they don't need it for the uplift. Verified that #2464 is the only PR in this bulk merge touching aiter/jit/core.py and aiter/utility/mp_tuner.py: the diff between the branch and origin/main on those files is exactly #2464's +9/-1 and +5/-0, with no other PR content mixed in. Restoring both files to origin/main therefore drops #2464 cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Running the CK MoE 2-stage tuner fails immediately with two cascading errors,
leaving zero shapes tuned:
Root causes
1 —
module_moe_asm.sowas compiled without torch linkage (aiter/jit/core.py)module_moe_asmis used in two modes: as a pybind extension (loaded viaimportlib) and as a raw ctypes library (loaded viactypes.CDLL).The ctypes path lives in
_ctypes_call._ensure_loaded()and, before this fix,unconditionally forced:
torch_exclude=Truedrops all torch link flags (-ltorch_cpu,-lc10, etc.)from the build. This is intentional for purely standalone C kernel modules —
but
module_moe_asm's source list includescsrc/pybind/moe_op_pybind.cu, apybind11 file that uses
at::Tensorand other PyTorch C++ types. Those typesreference the vtable symbol
_ZTVN5torch8autograd12AutogradMetaE, which thelinker left undefined:
Because
libtorch_cpu.sowas absent fromNEEDED,ctypes.CDLLfailed whenloading the
.soin tuner worker subprocesses. The tuner usesmp.set_start_method("spawn"), so each worker starts a fresh Pythoninterpreter. Even though
import torchruns inside the worker (viamp_tuner.py's top-level import), PyTorch's own shared libraries are loadedas transitive dependencies of
torch._C.so. On Linux, transitivedependencies do not inherit
RTLD_GLOBAL, so their symbols are invisible tosubsequent
ctypes.CDLLcalls.Fix: remove the forced override so each module is built with its own
configured value (which defaults to
torch_exclude=Falseformodule_moe_asm). The rebuilt.solistslibtorch_cpu.soinNEEDED, andthe dynamic linker resolves all symbols automatically.
2 — Failed tasks left
result_dictempty, causingIndexError(aiter/utility/mp_tuner.py)mp_tunercollects worker results intoresult_dict[k]and reconstructs aflat list at the end:
When a task raises any exception that is not
MPTimeoutErrorand notKeyError, the existing handler logged the error and marked the task ascomplete — but never wrote a placeholder into
result_dict[k]. The loopthen called
task_result[0]on an empty list and raisedIndexError, crashingthe tuner before it could print which shapes failed.
Fix: add the same
add_dummy_resultcall that already existed for thetimeout branch so every failed task gets a
(info, float("inf"), 1.0)placeholder in
result_dict.How the two bugs interact
Bug 1 caused every ASM-stage-1 task to raise
OSErrorin the subprocess.Bug 2 meant that the parent process then crashed with
IndexErrorwhencollecting results — before any summary or diagnostic output could be printed.
Fixing Bug 2 alone allows the tuner to survive and produce a proper
"Failed shapes" summary, but no ASM-stage-1 kernels are tuned. Both fixes are
required for correct operation.
Testing
After deleting the stale
.so(so it is rebuilt with correct flags on the nextrun) and applying both patches, the tuner completes without errors.
Notes for reviewers
The
torch_exclude=Trueoverride in_ctypes_call._ensure_loadedis safe toremove because any module that has
torch_excludeexplicitly set toTrueinoptCompilerConfig.json(e.g.module_aiter_enum) will continue to use thatvalue. Only modules that rely on the default (
False) are affected.If a future module genuinely needs both ctypes loading and a torch-free
build, its pybind source file should be separated from its pure-C sources into
a distinct build target so the two concerns do not conflict at link time.
PR created with Cursor.