Skip to content

Fixing two cascading bugs when running the CK MoE tuner#2464

Open
xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
xaguilar-amd:xaguilar-amd/fix-ck-moe_tuning
Open

Fixing two cascading bugs when running the CK MoE tuner#2464
xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
xaguilar-amd:xaguilar-amd/fix-ck-moe_tuning

Conversation

@xaguilar-amd
Copy link
Copy Markdown
Contributor

@xaguilar-amd xaguilar-amd commented Mar 25, 2026

Problem

Running the CK MoE 2-stage tuner fails immediately with two cascading errors,
leaving zero shapes tuned:

[aiter] [Failed] Task 4 failed with OSError:
  aiter/jit/module_moe_asm.so: undefined symbol: _ZTVN5torch8autograd12AutogradMetaE

[aiter] error in batch 1 of 1: list index out of range
Traceback (most recent call last):
  File "aiter/utility/base_tuner.py", line 468, in run
    all_results = self.tune(batch, self.tunedf, args)
  File "csrc/ck_gemm_moe_2stages_codegen/gemm_moe_tune.py", line 2311, in tune
    rets = mp_tuner(...)
  File "aiter/utility/mp_tuner.py", line 535, in mp_tuner
    result.append(task_result[0])
IndexError: list index out of range

Root causes

1 — module_moe_asm.so was compiled without torch linkage (aiter/jit/core.py)

module_moe_asm is used in two modes: as a pybind extension (loaded via
importlib) and as a raw ctypes library (loaded via ctypes.CDLL).
The ctypes path lives in _ctypes_call._ensure_loaded() and, before this fix,
unconditionally forced:

d_args["torch_exclude"] = True

torch_exclude=True drops all torch link flags (-ltorch_cpu, -lc10, etc.)
from the build. This is intentional for purely standalone C kernel modules —
but module_moe_asm's source list includes csrc/pybind/moe_op_pybind.cu, a
pybind11 file that uses at::Tensor and other PyTorch C++ types. Those types
reference the vtable symbol _ZTVN5torch8autograd12AutogradMetaE, which the
linker left undefined:

$ nm -D aiter/jit/module_moe_asm.so | grep AutogradMeta
                 U _ZTVN5torch8autograd12AutogradMetaE

Because libtorch_cpu.so was absent from NEEDED, ctypes.CDLL failed when
loading the .so in tuner worker subprocesses. The tuner uses
mp.set_start_method("spawn"), so each worker starts a fresh Python
interpreter. Even though import torch runs inside the worker (via
mp_tuner.py's top-level import), PyTorch's own shared libraries are loaded
as transitive dependencies of torch._C.so. On Linux, transitive
dependencies do not inherit RTLD_GLOBAL, so their symbols are invisible to
subsequent ctypes.CDLL calls.

Fix: remove the forced override so each module is built with its own
configured value (which defaults to torch_exclude=False for
module_moe_asm). The rebuilt .so lists libtorch_cpu.so in NEEDED, and
the dynamic linker resolves all symbols automatically.

2 — Failed tasks left result_dict empty, causing IndexError (aiter/utility/mp_tuner.py)

mp_tuner collects worker results into result_dict[k] and reconstructs a
flat list at the end:

for k in range(len(rets)):
    task_result = result_dict.get(k, [])
    ...
    result.append(task_result[0])   # IndexError if task_result is []

When a task raises any exception that is not MPTimeoutError and not
KeyError, the existing handler logged the error and marked the task as
complete — but never wrote a placeholder into result_dict[k]. The loop
then called task_result[0] on an empty list and raised IndexError, crashing
the tuner before it could print which shapes failed.

Fix: add the same add_dummy_result call that already existed for the
timeout branch so every failed task gets a (info, float("inf"), 1.0)
placeholder in result_dict.


How the two bugs interact

Bug 1 caused every ASM-stage-1 task to raise OSError in the subprocess.
Bug 2 meant that the parent process then crashed with IndexError when
collecting results — before any summary or diagnostic output could be printed.

Fixing Bug 2 alone allows the tuner to survive and produce a proper
"Failed shapes" summary, but no ASM-stage-1 kernels are tuned. Both fixes are
required for correct operation.


Testing

After deleting the stale .so (so it is rebuilt with correct flags on the next
run) and applying both patches, the tuner completes without errors.


Notes for reviewers

  • The torch_exclude=True override in _ctypes_call._ensure_loaded is safe to
    remove because any module that has torch_exclude explicitly set to True in
    optCompilerConfig.json (e.g. module_aiter_enum) will continue to use that
    value. Only modules that rely on the default (False) are affected.

  • If a future module genuinely needs both ctypes loading and a torch-free
    build, its pybind source file should be separated from its pure-C sources into
    a distinct build target so the two concerns do not conflict at link time.

    PR created with Cursor.

@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2464 --add-label <label>

@xaguilar-amd xaguilar-amd marked this pull request as ready for review March 25, 2026 13:43
@xaguilar-amd xaguilar-amd requested a review from a team March 25, 2026 13:43
@valarLip valarLip requested a review from amd-ruitang3 March 25, 2026 14:06
azaidy added a commit that referenced this pull request May 4, 2026
@xaguilar-amd asked to drop #2464 (CK MoE tuner bug fixes) from this
bulk merge — they don't need it for the uplift.

Verified that #2464 is the only PR in this bulk merge touching
aiter/jit/core.py and aiter/utility/mp_tuner.py: the diff between the
branch and origin/main on those files is exactly #2464's +9/-1 and
+5/-0, with no other PR content mixed in. Restoring both files to
origin/main therefore drops #2464 cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant