[AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception by duburcqa · Pull Request #535 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-21T06:41:46Z

Surface LLVM adstack push/pop overflow as a Python exception

Replaces the pre-existing silent-wrong-gradient-on-overflow with a catchable Python exception at the next qd.sync(). Teardown path is kept safe so the poll does not throw into ~Program().

TL;DR

qd.init(arch=qd.cpu, ad_stack_experimental_enabled=True, ad_stack_size=32)
# ... user kernel whose reverse pass pushes more than 32 entries per adstack ...
with pytest.raises((AssertionError, RuntimeError), match=r"[Aa]dstack overflow"):
    compute.grad()
    qd.sync()

# Subsequent sync is clean; the flag resets after catch.
qd.sync()

The prior behaviour on LLVM: stack_push wrapped past max_num_elements by a silent n++ with no check, so overflow produced out-of-bounds primal slots (if the alloca ever extended past its capacity) or plain overwrites of the last slot (if the runtime's stack_top_primal indexing stopped pointing at valid storage). Either way: silently wrong gradients with no error.

New behaviour: stack_push bounds-checks n + 1 > max_num_elements and, on overflow, sets runtime->adstack_overflow_flag = 1 (via __atomic_store_n(..., __ATOMIC_RELAXED) for multi-threaded CPU dispatch) and skips the increment instead of wrapping or trapping. stack_pop and stack_top_primal clamp at n == 0 so post-overflow code continues to execute without reading out of bounds. The flag surfaces the failure at the next qd.sync() via a new check_adstack_overflow() poll that throws QuadrantsAssertionError.

Why

Silent-wrong-gradient is the worst failure mode: no error, no crash, just a wrong answer that propagates through downstream numerical code and may take days of optimizer misbehaviour to diagnose. The fix has to be loud at sync time, but it can't throw during process teardown (where std::terminate() replaces the Python exception with an abort). Both constraints shape the design.

Mechanism

Runtime-side flag

struct LLVMRuntime {
  // ...
  i64 error_code = 0;
  // Dedicated flag for adstack-overflow-specific errors. Separate from `error_code` so assertions
  // (which set error_code=1 and are only surfaced when `compile_config.debug` is on) do not leak
  // through the always-on poll that Program::synchronize runs.
  i64 adstack_overflow_flag = 0;
  // ...
};

A second i64 on the runtime struct, intentionally disjoint from error_code:

error_code is tied to assert_failed and only polled when compile_config.debug is on.
adstack_overflow_flag is polled unconditionally at every synchronize() because silent wrong gradients are too dangerous to gate on a debug flag.

`stack_push` / `stack_pop` / `stack_top_primal`

void stack_push(LLVMRuntime *runtime, Ptr stack, size_t max_num_elements, std::size_t element_size) {
  u64 &n = *(u64 *)stack;
  if (n + 1 > max_num_elements) {
    __atomic_store_n(&runtime->adstack_overflow_flag, (i64)1, __ATOMIC_RELAXED);
    return;  // skip the increment
  }
  n += 1;
  std::memset(stack_top_primal(stack, element_size), 0, element_size * 2);
}

void stack_pop(Ptr stack) {
  auto &n = *(u64 *)stack;
  if (n > 0) { n--; }  // clamp at 0 so post-overflow code keeps running safely
}

Ptr stack_top_primal(Ptr stack, std::size_t element_size) {
  auto n = *(u64 *)stack;
  std::size_t idx = n > 0 ? n - 1 : 0;  // n == 0 returns slot 0 (garbage, but in-bounds)
  return stack + sizeof(u64) + idx * 2 * element_size;
}

The __ATOMIC_RELAXED store is required because multiple CPU worker threads may race on the same flag (thread pool dispatch over a multi-element field). Plain unsynchronized stores from multiple threads to the same object are a data race even when all writers store the same value — the compiler is free to tear the write. Relaxed atomic store compiles to a regular naturally-aligned store on x86_64 / arm64 but satisfies the C++11 memory model. The host only reads the flag after the thread pool has joined, so no ordering beyond "happens eventually" is required.

Host-side poll

LlvmRuntimeExecutor::check_adstack_overflow():

Reads and resets the flag via runtime_retrieve_and_reset_adstack_overflow(runtime) — a one-shot JIT call that __atomic_exchange_ns the flag to 0 and writes the old value into result_buffer[quadrants_result_buffer_error_id].
If the old value is nonzero, throws QuadrantsAssertionError with a message pointing users at default_ad_stack_size.
Safe to call before materialize_runtime(): no-op when llvm_runtime_ or result_buffer_cache_ is null.

Teardown safety

LlvmProgramImpl::synchronize() calls check_adstack_overflow() unconditionally — except during teardown, where a re-raise would std::terminate() out of ~Program(). The teardown guard:

void LlvmProgramImpl::pre_finalize() override { finalizing_ = true; }
void LlvmProgramImpl::finalize() override { finalizing_ = true;  /* defensive re-assignment */ }
void LlvmProgramImpl::synchronize() override {
  runtime_exec_->synchronize();
  if (!finalizing_) { runtime_exec_->check_adstack_overflow(); }
}

Program::finalize() invokes program_impl_->pre_finalize() before the two teardown syncs, so the flag is already true when those syncs run. The defensive re-assignment in finalize() is belt-and-suspenders: moving the assignment back into finalize() alone would silently re-introduce the std::terminate() teardown bug (the header's field-level comment spells this out).

Codegen-side plumbing

TaskCodeGenLLVM::visit(AdStackPushStmt) now threads get_runtime() as the first arg to stack_push, matching the new signature. Everything else in the codegen is unchanged — push/pop/top still take the opaque Ptr and don't need to know about the flag.

Program-level hook

Program::finalize() invokes a new program_impl_->pre_finalize() hook before its two teardown syncs. A matching no-op virtual void pre_finalize() {} is added to ProgramImpl so non-LLVM backends are unaffected.

Internal sanity test

internal_functions.h::test_stack() is extended to exercise the new contract: push past capacity, verify n stays at max_num_elements and the flag flips; over-pop past n == 0, verify n stays at 0 and stack_top_primal clamps at slot 0; reset the flag so subsequent tests in the same fixture are not poisoned. Also plugs the pre-existing new u8[132] leak with a matching delete[] stack; before return 0;.

Tests

Five new tests in tests/python/test_adstack.py, all using a shared _overflowing_compute(n_elements, n_iter=64) helper (64 iterations + 2 setup pushes = 66 pushes, comfortably above every test's ad_stack_size=32):

test_adstack_overflow_raises — overflow surfaces as (AssertionError, RuntimeError) at qd.sync(), not silently.
test_adstack_overflow_flag_resets_after_catch — after the first raise, a subsequent qd.sync() with no new grad launch returns normally.
test_adstack_large_capacity_resolves_overflow — raising ad_stack_size to 1024 lets the same kernel run to completion with the correct gradient (pins the remediation path the error message points users at).
test_adstack_overflow_multithreaded — 16-element field so several threads overflow in parallel. Asserts the raise is still clean (no deadlock, no crash, one exception regardless of how many threads flipped the bit).
test_adstack_overflow_during_teardown_does_not_abort — runs the overflowing grad in a child process and intentionally never calls qd.sync(). Child must exit with returncode 0 (the pre_finalize guard suppresses the poll during teardown), not SIGABRT.

Side-effect audit

Concern	Where checked	Verdict
Non-debug runs	Poll is not gated on `compile_config.debug`	Intentional — silent wrong gradients must not be debug-gated.
Multi-thread race on flag	`__atomic_store_n(..., __ATOMIC_RELAXED)`	Avoids C++11 data race; compiles to a regular store on all supported arches.
Teardown path	`pre_finalize()` flips flag before teardown syncs	Covered by `test_adstack_overflow_during_teardown_does_not_abort`.
Post-overflow continuation	`pop` clamps at 0, `top_primal` clamps at slot 0	Host raises on the flag before any post-overflow read reaches user code.
Header field name typo	`Program::result_buffer` (no trailing underscore)	Corrected in the `result_buffer_cache_` comment after claude bot flagged it.
Pre-existing `test_stack` memory leak	`delete[] stack;` added	Pre-existing; fixed opportunistically since we're rewriting the function anyway.

Stack

Autodiff 8 of 13. Second commit of the "LLVM adstack safety" triplet split. Based on #534 (header size). Followed by #495 (codegen budget guard).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d1a9c46601

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hughperkins · 2026-04-21T18:41:26Z

Checklist:

tightening of error handling => doesn't change existing API or usage
- no need for doc changes

=> ok to merge

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp

This was referenced Apr 21, 2026

[AutoDiff] Autodiff 7: Fix header size in AdStackAllocaStmt to match u64 runtime layout #534

Merged

[AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget #495

Merged

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 0627f69 to 1caca8f Compare April 21, 2026 06:59

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from d1a9c46 to 97533d1 Compare April 21, 2026 06:59

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp

Comment thread tests/python/test_adstack.py

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 1caca8f to 4c5ee88 Compare April 21, 2026 07:19

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 97533d1 to adf1dc1 Compare April 21, 2026 07:19

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 4c5ee88 to 11d2006 Compare April 21, 2026 08:18

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from adf1dc1 to baabb85 Compare April 21, 2026 08:18

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 11d2006 to 903e8e6 Compare April 21, 2026 09:50

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from baabb85 to 727e1a4 Compare April 21, 2026 09:51

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 903e8e6 to a222483 Compare April 21, 2026 12:03

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 727e1a4 to c9751cf Compare April 21, 2026 12:03

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from a222483 to c4665c5 Compare April 21, 2026 14:42

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from c9751cf to 621e760 Compare April 21, 2026 14:42

duburcqa mentioned this pull request Apr 21, 2026

[Bug]: loss.backward() hangs indefinitely for articulated robots with freejoint + child joints Genesis-Embodied-AI/Genesis#2537

Open

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from c4665c5 to 8dc6a50 Compare April 21, 2026 19:05

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 621e760 to 00ef8f7 Compare April 21, 2026 19:05

duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 8dc6a50 to fcf6bff Compare April 22, 2026 11:47

hughperkins added the ok-to-merge label Apr 22, 2026

Base automatically changed from duburcqa/split_llvm_adstack_header_size to main April 22, 2026 12:56

[AutoDiff] Surface LLVM adstack push/pop overflow as a Python exception

cca0a2a

duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 00ef8f7 to cca0a2a Compare April 22, 2026 12:58

duburcqa merged commit af50758 into main Apr 22, 2026
47 checks passed

duburcqa deleted the duburcqa/split_llvm_adstack_runtime_overflow branch April 22, 2026 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception#535

[AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception#535
duburcqa merged 1 commit intomainfrom
duburcqa/split_llvm_adstack_runtime_overflow

duburcqa commented Apr 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duburcqa commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Surface LLVM adstack push/pop overflow as a Python exception

TL;DR

Why

Mechanism

Runtime-side flag

stack_push / stack_pop / stack_top_primal

Host-side poll

Teardown safety

Codegen-side plumbing

Program-level hook

Internal sanity test

Tests

Side-effect audit

Stack

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented Apr 21, 2026 •

edited

Loading

`stack_push` / `stack_pop` / `stack_top_primal`