Skip to content

[AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception#535

Merged
duburcqa merged 1 commit intomainfrom
duburcqa/split_llvm_adstack_runtime_overflow
Apr 22, 2026
Merged

[AutoDiff] Autodiff 8: Surface LLVM adstack push/pop overflow as a Python exception#535
duburcqa merged 1 commit intomainfrom
duburcqa/split_llvm_adstack_runtime_overflow

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 21, 2026

Surface LLVM adstack push/pop overflow as a Python exception

Replaces the pre-existing silent-wrong-gradient-on-overflow with a catchable Python exception at the next qd.sync(). Teardown path is kept safe so the poll does not throw into ~Program().

TL;DR

qd.init(arch=qd.cpu, ad_stack_experimental_enabled=True, ad_stack_size=32)
# ... user kernel whose reverse pass pushes more than 32 entries per adstack ...
with pytest.raises((AssertionError, RuntimeError), match=r"[Aa]dstack overflow"):
    compute.grad()
    qd.sync()

# Subsequent sync is clean; the flag resets after catch.
qd.sync()

The prior behaviour on LLVM: stack_push wrapped past max_num_elements by a silent n++ with no check, so overflow produced out-of-bounds primal slots (if the alloca ever extended past its capacity) or plain overwrites of the last slot (if the runtime's stack_top_primal indexing stopped pointing at valid storage). Either way: silently wrong gradients with no error.

New behaviour: stack_push bounds-checks n + 1 > max_num_elements and, on overflow, sets runtime->adstack_overflow_flag = 1 (via __atomic_store_n(..., __ATOMIC_RELAXED) for multi-threaded CPU dispatch) and skips the increment instead of wrapping or trapping. stack_pop and stack_top_primal clamp at n == 0 so post-overflow code continues to execute without reading out of bounds. The flag surfaces the failure at the next qd.sync() via a new check_adstack_overflow() poll that throws QuadrantsAssertionError.

Why

Silent-wrong-gradient is the worst failure mode: no error, no crash, just a wrong answer that propagates through downstream numerical code and may take days of optimizer misbehaviour to diagnose. The fix has to be loud at sync time, but it can't throw during process teardown (where std::terminate() replaces the Python exception with an abort). Both constraints shape the design.

Mechanism

Runtime-side flag

struct LLVMRuntime {
  // ...
  i64 error_code = 0;
  // Dedicated flag for adstack-overflow-specific errors. Separate from `error_code` so assertions
  // (which set error_code=1 and are only surfaced when `compile_config.debug` is on) do not leak
  // through the always-on poll that Program::synchronize runs.
  i64 adstack_overflow_flag = 0;
  // ...
};

A second i64 on the runtime struct, intentionally disjoint from error_code:

  • error_code is tied to assert_failed and only polled when compile_config.debug is on.
  • adstack_overflow_flag is polled unconditionally at every synchronize() because silent wrong gradients are too dangerous to gate on a debug flag.

stack_push / stack_pop / stack_top_primal

void stack_push(LLVMRuntime *runtime, Ptr stack, size_t max_num_elements, std::size_t element_size) {
  u64 &n = *(u64 *)stack;
  if (n + 1 > max_num_elements) {
    __atomic_store_n(&runtime->adstack_overflow_flag, (i64)1, __ATOMIC_RELAXED);
    return;  // skip the increment
  }
  n += 1;
  std::memset(stack_top_primal(stack, element_size), 0, element_size * 2);
}

void stack_pop(Ptr stack) {
  auto &n = *(u64 *)stack;
  if (n > 0) { n--; }  // clamp at 0 so post-overflow code keeps running safely
}

Ptr stack_top_primal(Ptr stack, std::size_t element_size) {
  auto n = *(u64 *)stack;
  std::size_t idx = n > 0 ? n - 1 : 0;  // n == 0 returns slot 0 (garbage, but in-bounds)
  return stack + sizeof(u64) + idx * 2 * element_size;
}

The __ATOMIC_RELAXED store is required because multiple CPU worker threads may race on the same flag (thread pool dispatch over a multi-element field). Plain unsynchronized stores from multiple threads to the same object are a data race even when all writers store the same value — the compiler is free to tear the write. Relaxed atomic store compiles to a regular naturally-aligned store on x86_64 / arm64 but satisfies the C++11 memory model. The host only reads the flag after the thread pool has joined, so no ordering beyond "happens eventually" is required.

Host-side poll

LlvmRuntimeExecutor::check_adstack_overflow():

  • Reads and resets the flag via runtime_retrieve_and_reset_adstack_overflow(runtime) — a one-shot JIT call that __atomic_exchange_ns the flag to 0 and writes the old value into result_buffer[quadrants_result_buffer_error_id].
  • If the old value is nonzero, throws QuadrantsAssertionError with a message pointing users at default_ad_stack_size.
  • Safe to call before materialize_runtime(): no-op when llvm_runtime_ or result_buffer_cache_ is null.

Teardown safety

LlvmProgramImpl::synchronize() calls check_adstack_overflow() unconditionally — except during teardown, where a re-raise would std::terminate() out of ~Program(). The teardown guard:

void LlvmProgramImpl::pre_finalize() override { finalizing_ = true; }
void LlvmProgramImpl::finalize() override { finalizing_ = true;  /* defensive re-assignment */ }
void LlvmProgramImpl::synchronize() override {
  runtime_exec_->synchronize();
  if (!finalizing_) { runtime_exec_->check_adstack_overflow(); }
}

Program::finalize() invokes program_impl_->pre_finalize() before the two teardown syncs, so the flag is already true when those syncs run. The defensive re-assignment in finalize() is belt-and-suspenders: moving the assignment back into finalize() alone would silently re-introduce the std::terminate() teardown bug (the header's field-level comment spells this out).

Codegen-side plumbing

TaskCodeGenLLVM::visit(AdStackPushStmt) now threads get_runtime() as the first arg to stack_push, matching the new signature. Everything else in the codegen is unchanged — push/pop/top still take the opaque Ptr and don't need to know about the flag.

Program-level hook

Program::finalize() invokes a new program_impl_->pre_finalize() hook before its two teardown syncs. A matching no-op virtual void pre_finalize() {} is added to ProgramImpl so non-LLVM backends are unaffected.

Internal sanity test

internal_functions.h::test_stack() is extended to exercise the new contract: push past capacity, verify n stays at max_num_elements and the flag flips; over-pop past n == 0, verify n stays at 0 and stack_top_primal clamps at slot 0; reset the flag so subsequent tests in the same fixture are not poisoned. Also plugs the pre-existing new u8[132] leak with a matching delete[] stack; before return 0;.

Tests

Five new tests in tests/python/test_adstack.py, all using a shared _overflowing_compute(n_elements, n_iter=64) helper (64 iterations + 2 setup pushes = 66 pushes, comfortably above every test's ad_stack_size=32):

  1. test_adstack_overflow_raises — overflow surfaces as (AssertionError, RuntimeError) at qd.sync(), not silently.
  2. test_adstack_overflow_flag_resets_after_catch — after the first raise, a subsequent qd.sync() with no new grad launch returns normally.
  3. test_adstack_large_capacity_resolves_overflow — raising ad_stack_size to 1024 lets the same kernel run to completion with the correct gradient (pins the remediation path the error message points users at).
  4. test_adstack_overflow_multithreaded — 16-element field so several threads overflow in parallel. Asserts the raise is still clean (no deadlock, no crash, one exception regardless of how many threads flipped the bit).
  5. test_adstack_overflow_during_teardown_does_not_abort — runs the overflowing grad in a child process and intentionally never calls qd.sync(). Child must exit with returncode 0 (the pre_finalize guard suppresses the poll during teardown), not SIGABRT.

Side-effect audit

Concern Where checked Verdict
Non-debug runs Poll is not gated on compile_config.debug Intentional — silent wrong gradients must not be debug-gated.
Multi-thread race on flag __atomic_store_n(..., __ATOMIC_RELAXED) Avoids C++11 data race; compiles to a regular store on all supported arches.
Teardown path pre_finalize() flips flag before teardown syncs Covered by test_adstack_overflow_during_teardown_does_not_abort.
Post-overflow continuation pop clamps at 0, top_primal clamps at slot 0 Host raises on the flag before any post-overflow read reaches user code.
Header field name typo Program::result_buffer (no trailing underscore) Corrected in the result_buffer_cache_ comment after claude bot flagged it.
Pre-existing test_stack memory leak delete[] stack; added Pre-existing; fixed opportunistically since we're rewriting the function anyway.

Stack

Autodiff 8 of 13. Second commit of the "LLVM adstack safety" triplet split. Based on #534 (header size). Followed by #495 (codegen budget guard).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d1a9c46601

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp
Comment thread quadrants/runtime/llvm/runtime_module/runtime.cpp
Comment thread tests/python/test_adstack.py
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 1caca8f to 4c5ee88 Compare April 21, 2026 07:19
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 97533d1 to adf1dc1 Compare April 21, 2026 07:19
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 4c5ee88 to 11d2006 Compare April 21, 2026 08:18
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from adf1dc1 to baabb85 Compare April 21, 2026 08:18
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 11d2006 to 903e8e6 Compare April 21, 2026 09:50
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from baabb85 to 727e1a4 Compare April 21, 2026 09:51
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 903e8e6 to a222483 Compare April 21, 2026 12:03
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 727e1a4 to c9751cf Compare April 21, 2026 12:03
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from a222483 to c4665c5 Compare April 21, 2026 14:42
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from c9751cf to 621e760 Compare April 21, 2026 14:42
@hughperkins
Copy link
Copy Markdown
Collaborator

Checklist:

  • tightening of error handling => doesn't change existing API or usage
    • no need for doc changes

=> ok to merge

@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from c4665c5 to 8dc6a50 Compare April 21, 2026 19:05
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 621e760 to 00ef8f7 Compare April 21, 2026 19:05
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_header_size branch from 8dc6a50 to fcf6bff Compare April 22, 2026 11:47
Base automatically changed from duburcqa/split_llvm_adstack_header_size to main April 22, 2026 12:56
@duburcqa duburcqa force-pushed the duburcqa/split_llvm_adstack_runtime_overflow branch from 00ef8f7 to cca0a2a Compare April 22, 2026 12:58
@duburcqa duburcqa merged commit af50758 into main Apr 22, 2026
47 checks passed
@duburcqa duburcqa deleted the duburcqa/split_llvm_adstack_runtime_overflow branch April 22, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants