[AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices by duburcqa · Pull Request #610 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-05-01T21:13:06Z

SNode-arm bound-expr capture rejects fold-attack gate indices

Follow-up to #599 (sparse adstack heap, merged). Closes the SNode-arm capture path against fold-attack gate-index shapes that previously either silently corrupted gradients (LLVM CUDA / AMDGPU) or hit the codegen-emitted overflow signal (SPIR-V Vulkan / Metal), and aligns multi-axis ndrange capture across LLVM and SPIR-V. No surface-API change.

TL;DR

The static-adstack analysis SNode arm of match_field_source now applies four structural checks before publishing a captured bound_expr:

Iteration-count check - task_ir->end_value - begin_value <= snode_iter_count (skipped when the loop bound is runtime-resolved).
At-least-one-iterating-axis check - at least one component of the gate's LinearizeStmt::inputs (or, after lower_access, of the recovered floordiv / mod / sub axis components) transitively contains a LoopIndexStmt, with single-axis non-bare iterating shapes (field[i / 2], field[i % K], field[i + 5]) rejected.
Distinct-axis value check - when there are two or more iterating axes, every iterating axis must hold a structurally distinct value (compared via irpass::analysis::same_value, not pointer identity). Pairwise same_value deduplication collapses CSE-fused field[i % 2, i % 2] and survives obfuscation attempts like (i % 2) + 0 - 0 paired with i % 2 that an attacker might use to defeat alg-simp / CSE; the canonical qd.ndrange(*shape) decomposition produces axes with structurally different values (i // K0, (i % K0) // K1, i % K1) even though every axis roots at the same LoopIndexStmt, so it captures uniformly across LLVM and SPIR-V backends.
Joint-axis-product check - when no iterating axis is the task loop's bare LoopIndexStmt (which would make the joint mapping injective by itself), the product of per-axis value ranges must cover the loop trip count. Each axis's range is recovered by walking the lowered arithmetic for _ % K, _ // K, and the post-lower_access sub(L, mul(floordiv(L, K), K)) / sub(L, bit_shl(floordiv(L, K), log2(K))) shapes; an unrecognised shape contributes the parent's range conservatively. Catches selector[i % K0, (i // K0) % K1] against an oversized SNode where loop_iter > K0 * K1: every axis is value-distinct (so check 3 admits) and the SNode has spare cells (so check 1 admits), but the joint mapping wraps onto a K0 * K1-cell subspace.

Gates that fail any check fall through to the dispatched-threads worst-case heap; legitimate gates (single- and multi-axis loops, ndrange-decomposed indices, kernel-arg slicing alongside iterating axes) capture as before.

Why

The SNode arm trusted whatever index expression the codegen passed to SNodeLookupStmt. Several fold-attack shapes slipped through and either undersized the float adstack heap (silent corruption on LLVM) or tripped the codegen-emitted overflow signal at sync time (hard error on SPIR-V):

Shape	Mechanism	Caught by
`selector[i % K]` with K < n	Loop iterates n, snode has K cells; n - K excess gated iterations alias onto row K-1.	iteration-count check
`selector[42]`	Every iteration hits cell 42; reducer count is launch-constant (0 or 1), main pass claims n rows.	at-least-one-iterating-axis check
`selector[arg]` (no iterating axis)	Same as `selector[42]` but the constant slot is a kernel argument.	at-least-one-iterating-axis check
`selector[other_field[i]]`	Index is a runtime load, not derivable from any loop axis statically.	at-least-one-iterating-axis check
`selector[i % 2, i % 2]`	Two iterating axes share a value; the joint mapping is many-to-one and aliases iterations onto a few cells.	distinct-axis value check
`selector[i % K0, (i // K0) % K1]` with `loop_iter > K0 * K1` and oversized SNode	Axes are value-distinct but joint mapping wraps onto a `K0 * K1` subspace.	joint-axis-product check

Separately, the canonical multi-axis qd.ndrange(*shape) shape (every ndrange axis is a floordiv / sub over the same LoopIndexStmt) was previously rejected on SPIR-V because the earlier distinct_iterating_sources rule walked back to root-equality and saw a single LoopIndexStmt source for N axes. Switching to value-equivalence on the axis statements admits the bijective ndrange decomposition uniformly while still rejecting the fold-attack shapes above.

Surface API

Nothing changes for users. The opt-in ad_stack_experimental_enabled=True flag and the ad_stack_sparse_threshold_bytes knob remain identical. The doc update in docs/source/user_guide/autodiff.md adds an Appendix B that lists the gate-index shapes that capture vs fall back to the worst-case heap.

Mechanism

All checks live inside the SNode arm of match_field_source in quadrants/transforms/static_adstack_analysis.cpp - no pre-autodiff IR walk, no OffloadedStmt field plumbing, no codegen change. After the existing snode_descriptor_resolver lookup:

const bool static_bound = task_ir->const_begin && task_ir->const_end && task_ir->end_stmt == nullptr;
const int64_t loop_iter = static_bound ? (task_ir->end_value - task_ir->begin_value) : 0;
if (static_bound && (loop_iter <= 0 || (uint64_t)loop_iter > (uint64_t)desc_opt->iter_count)) return false;

auto *lookup = getch->input_ptr->cast<SNodeLookupStmt>();
// Recover per-axis components from `LinearizeStmt::inputs` (StructFor path) or from the floordiv / mod / add /
// sub arithmetic tree (ndrange path expanded by `lower_access`); recurse through `BinaryOp` / `UnaryOp` looking
// for `LoopIndexStmt` (also accepting `AdStackLoadTopStmt` whose forward push carries a replayed loop index).
std::vector<Stmt *> distinct_iterating_axes;
int n_iterating = 0, n_bare_iterating = 0;
for (Stmt *axis : axes) {
  if (contains_loop_index(axis, 0)) {
    n_iterating++;
    if (axis->is<LoopIndexStmt>()) n_bare_iterating++;
    bool already_seen = false;
    for (Stmt *prev : distinct_iterating_axes) {
      if (prev == axis || irpass::analysis::same_value(prev, axis)) { already_seen = true; break; }
    }
    if (!already_seen) distinct_iterating_axes.push_back(axis);
  }
}
if (n_iterating == 0) return false;
if (n_iterating == 1 && n_bare_iterating == 0) return false;
if ((int)distinct_iterating_axes.size() < n_iterating) return false;

// Joint-axis-product check: walks each axis recursively to extract a value-range upper bound from `_ % K`,
// `_ // K`, and the post-`lower_access` `sub(L, mul/bit_shl(floordiv(L, K), K))` shapes. K is read directly
// from the `ConstStmt` rhs; an unrecognised shape contributes the parent's range conservatively. Skipped
// when any axis is the task loop's bare `LoopIndexStmt` (i alone identifies the iteration).
const bool any_task_loop_bare_index = std::any_of(axes.begin(), axes.end(), [&](Stmt *a) {
  auto *li = a->cast<LoopIndexStmt>(); return li && li->loop == task_ir;
});
if (static_bound && !any_task_loop_bare_index) {
  int64_t joint_product = 1;
  for (Stmt *axis : axes) {
    if (!contains_loop_index(axis, 0)) continue;
    joint_product = saturating_mul(joint_product, axis_max_range(axis, 0));
    if (joint_product >= loop_iter) break;
  }
  if (joint_product < loop_iter) return false;
}

Loop-invariant slice axes (ArgLoadStmt, ConstStmt) are accepted alongside iterating axes - the reducer over-counts by the slice factor (walks all cells of the SNode including unvisited slices), which is benign over-allocation. The pairwise same_value walk and the joint-axis-product walk are both O(n_axes^2) / O(n_axes * subtree_depth), fine because n_axes is bounded by SNode dimensionality (typically <= 5).

Per-backend matrix

Backend	Pre-PR	This PR
CPU LLVM	Compound-index passed accidentally on small thread counts	All four checks gate capture; ndrange unchanged
CUDA / AMDGPU LLVM	Silent gradient corruption on fold attacks; ndrange captured	Falls back to worst-case heap on fold attacks; ndrange still captures via distinct-axis check
Vulkan / Metal SPIR-V	Hard overflow signal at sync on fold attacks; ndrange fell back to worst-case heap	Falls back to worst-case heap on fold attacks; ndrange now captures via distinct-axis + joint-axis-product check (parity with LLVM)

Genesis test_differentiable_push[gpu]: mpm_grid_op_c65_0_reverse_grad_0_t00 (the canonical multi-axis ndrange shape for ii, jj, kk, ib in qd.ndrange(grid_res, B): if grid[f, ii, jj, kk, ib].mass > eps:) keeps capturing under this PR on every backend; the kernel-arg f slice plus four iterating axes pass all four checks. Local Metal verification of the same shape (repro_mpm_grid_arg_index_capture.py) shows the capture switching from src=worst_case_dispatched effective_rows=512 required_bytes=131072 to src=reducer_count effective_rows=128 required_bytes=32768.

Tests

tests/python/test_adstack.py:

test_adstack_static_bound_expr_snode_gate_non_bijective_index_grad_correct (parametrized over compound_mod / affine_div / constant_index / dynamic_load_index / folding_two_axis_decomp) - pins gradient correctness for every fold-attack shape on every parallel-dispatched backend. The folding_two_axis_decomp parametrization is the bot-flagged shape selector[i % 8, (i // 8) % 8] against an (8, 8) SNode with loop_iter = 256 > 64.
test_adstack_static_bound_expr_snode_gate_bijective_*_grad_correct (split into linear_range, multi_axis_structfor, multi_axis_ndrange, slice_with_iter, decomposed_index) - asserts the canonical capture shapes still engage and the gradient remains numerically correct. The new decomposed_index test pins selector[i // K, i % K] from a flat range loop, the multi-axis split shape that same_value-based dedup unblocks across LLVM and SPIR-V.
All existing test_adstack_static_bound_expr_snode_gate_* tests pass unchanged.

Local Metal: 674 passed, 1 skipped, 7 xfailed (unrelated NaN / sizer-mutation issues, identical xfail set to base).

Side-effect audit

The SNode arm's accept rate is now stricter for fold-attack shapes (correct, just allocates more memory in the worst-case fallback) and looser for canonical multi-axis ndrange on SPIR-V (correct, this is the documented capture shape, gradient correctness verified by the bijective tests).
same_value is O(subtree_size) per pair and the axis-pair count is bounded by SNode dimensionality. The joint-axis-product walk is O(n_axes * subtree_depth). Both are flat additive costs on analyze_adstack_static_bounds.
The ndarray arm of match_field_source is unchanged.
No ABI / cache-key change: the StaticAdStackBoundExpr serialised fields are untouched.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b70fc4d147

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

duburcqa · 2026-05-01T22:50:39Z

@claude review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85da727842

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-02T00:49:34Z

Coverage Report (`c240bc212`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	99%	3390,3403

Diff coverage: 99% · Overall: 74% · 191 lines, 2 missing

Full annotated report

…via iter-count + axis-classification checks

…drange uniformly across LLVM and SPIR-V

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 16982eed0b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…lue, catches obfuscated equal-axis fold attacks

…gate to accepted patterns

… regrowth details into Appendix B

… oversized-snode multi-axis fold attacks

…osition rejection (joint-axis-space < trip count)

duburcqa · 2026-05-02T07:06:56Z

@claude review

hughperkins · 2026-05-02T08:06:28Z

 | `num_buffers` | Number of adstacks the kernel allocates - one per loop-carried variable plus one per dependent branch flag (see [One adstack per variable](#one-adstack-per-variable)). |

-Kernels of the shape `for i in range(...): if field[i] cmp literal: <adstack work>` (a runtime gate directly above the adstack-using body, comparing one field entry to a constant) shrink further: the compiler counts gate-passing iterations at launch time and sizes the float adstack to that count instead of `num_threads * stack_size`. A workload whose gate matches 5% of iterations pays 5% of the float-adstack cost; the float heap grows on demand if a later launch matches more. Integer / boolean adstacks stay at `num_threads * stack_size` - their pushes fire unconditionally for control-flow replay. The shrinking is exact only when the gate's per-axis index is a bare loop variable (`field[i]`, `field[I, J, K]`); see [What can go wrong](#what-can-go-wrong) for a known limitation on `qd.field`-backed gates indexed by compound expressions.
+The float heap is by far the main reverse-mode memory bottleneck because a typical kernel allocates many float-typed adstacks - one per floating-point loop-carried scalar, each storing both primal and adjoint - and the total scales as `num_threads * stack_size * num_float_buffers * 8` bytes, dominating the integer / boolean heap. Advanced static IR analysis is used to further shrink the float adstack in some common gated-kernel shapes: when a runtime gate sits directly above the adstack-using body and compares a single field entry to a constant, the compiler counts the gate-passing iterations at launch time and sizes the float adstack to that count, so a workload whose gate matches 5% of iterations pays 5% of the float-adstack cost. See [Appendix B: gate-index shapes that capture vs fall back to the worst-case heap](#appendix-b-gate-index-shapes-that-capture-vs-fall-back-to-the-worst-case-heap) for the authoritative list of supported shapes.


"and adjoint - and the total " => "and adjoint. The total "

"kernel shapes: when a runtime" => "kernel shapes. When a runtime"

"to that count, so a workload " => "to that count. So a workload "

hughperkins · 2026-05-02T08:08:34Z

checklist:

user-facing doc changes done
no major changes in hot files
in fact, no changes outside of autodiff feature files

=> ok to merge

github-actions · 2026-05-02T08:29:40Z

Coverage Report (`42cd91320`)

File	Coverage	Missing
🟢 `tests/python/test_adstack.py`	99%	3392,3409

Diff coverage: 99% · Overall: 74% · 227 lines, 2 missing

Full annotated report

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

Comment thread quadrants/transforms/static_adstack_analysis.cpp Outdated

Comment thread quadrants/transforms/static_adstack_analysis.cpp Outdated

duburcqa force-pushed the duburcqa/snode_arm_fold_attack_validation branch 4 times, most recently from fa4a713 to 7daaa10 Compare May 1, 2026 22:01

duburcqa marked this pull request as draft May 1, 2026 22:03

Base automatically changed from duburcqa/sparse_adstack_heap to main May 1, 2026 22:09

duburcqa force-pushed the duburcqa/snode_arm_fold_attack_validation branch 5 times, most recently from 4deb0c2 to 85da727 Compare May 1, 2026 22:50

duburcqa marked this pull request as ready for review May 1, 2026 22:50

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

Comment thread quadrants/transforms/static_adstack_analysis.cpp

duburcqa force-pushed the duburcqa/snode_arm_fold_attack_validation branch from 85da727 to 4167d83 Compare May 1, 2026 23:04

duburcqa marked this pull request as draft May 1, 2026 23:15

duburcqa force-pushed the duburcqa/snode_arm_fold_attack_validation branch 2 times, most recently from 33bb554 to c240bc2 Compare May 1, 2026 23:29

duburcqa marked this pull request as ready for review May 2, 2026 06:00

duburcqa added 2 commits May 2, 2026 08:00

[Lang] SNode-arm bound-expr capture rejects fold-attack gate indices …

a470eb9

…via iter-count + axis-classification checks

[Lang] SNode-arm bound-expr capture: distinct-axis identity, accept n…

7767967

…drange uniformly across LLVM and SPIR-V

duburcqa force-pushed the duburcqa/snode_arm_fold_attack_validation branch from 16982ee to 7767967 Compare May 2, 2026 06:02

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

Comment thread quadrants/transforms/static_adstack_analysis.cpp

duburcqa added 5 commits May 2, 2026 08:09

[Lang] SNode-arm bound-expr capture: dedupe iterating axes by same_va…

5595315

…lue, catches obfuscated equal-axis fold attacks

[Lang] Adstack tests + autodiff doc: move bijective decomposed-index …

5af0ffb

…gate to accepted patterns

[Doc] Adstack memory footprint: move worst-case fallback / int-bool /…

44db74c

… regrowth details into Appendix B

[Lang] SNode-arm bound-expr capture: joint-axis-product check rejects…

998876c

… oversized-snode multi-axis fold attacks

[Doc] Adstack autodiff appendix B: document multi-axis folding-decomp…

42cd913

…osition rejection (joint-axis-space < trip count)

hughperkins reviewed May 2, 2026

View reviewed changes

hughperkins added the ok-to-merge label May 2, 2026

duburcqa merged commit 4e06748 into main May 2, 2026
54 checks passed

duburcqa deleted the duburcqa/snode_arm_fold_attack_validation branch May 2, 2026 08:55

duburcqa mentioned this pull request May 2, 2026

[AutoDiff] Adstack heap: clip reducer count by per-task loop trip count (compile-time and SizeExpr-evaluated) #611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices#610

[AutoDiff] SNode-arm bound-expr capture rejects fold-attack gate indices#610
duburcqa merged 7 commits intomainfrom
duburcqa/snode_arm_fold_attack_validation

duburcqa commented May 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

duburcqa commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

duburcqa commented May 2, 2026

Uh oh!

hughperkins May 2, 2026 •

edited

Loading

Uh oh!

hughperkins commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duburcqa commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SNode-arm bound-expr capture rejects fold-attack gate indices

TL;DR

Why

Surface API

Mechanism

Per-backend matrix

Tests

Side-effect audit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

duburcqa commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented May 2, 2026

Coverage Report (c240bc212)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

duburcqa commented May 2, 2026

Uh oh!

hughperkins May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Coverage Report (42cd91320)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented May 1, 2026 •

edited

Loading

Coverage Report (`c240bc212`)

hughperkins May 2, 2026 •

edited

Loading

Coverage Report (`42cd91320`)