[Lang] Add qd.precise(...) for per-op IEEE-strict FP. by duburcqa · Pull Request #476 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-13T14:18:01Z

`qd.precise(...)`: per-op IEEE-strict FP for compensated arithmetic

Two commits, no behavior change for users who don't opt in. Adds a single-purpose primitive that lets a user keep fast_math=True globally for performance while protecting individual floating-point expressions from reassociation, contraction, fast-math approximation, and algebraic folds.

TL;DR

@qd.func
def fast_two_sum(a, b):
    s = qd.precise(a + b)            # IEEE-strict add
    e = qd.precise(b - (s - a))      # IEEE-strict sub chain — would fold to 0 under fast-math
    return s, e

Mirrors the precise keyword in MSL/HLSL. Compensated-arithmetic kernels (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) now have a clean way to opt out of fast-math at the granularity of a single expression instead of having to flip the global fast_math=False switch and pay the perf cost everywhere else.

Why

Quadrants compiles kernels with fast_math=True by default. Under that mode every backend's compiler is free to:

reassociate FP ops ((a + b) + c → a + (b + c))
contract mul-then-add into FMA
substitute approximations for sqrt, sin, cos, log, 1/x (varies by backend)
algebraically simplify (0 * a → 0, a / a → 1, a - a → 0, a == a → true)

This silently destroys compensated-arithmetic primitives whose entire correctness is built on the fact that (a - aa) + (b - bb) is non-zero in IEEE arithmetic. The traditional workaround — globally flipping fast_math=False — is too coarse: it costs perf across every kernel even when only a handful of lines need IEEE semantics.

qd.precise(expr) is the per-op opt-in.

Surface API

qd.precise(obj)

Recursively tags every BinaryOpExpression and UnaryOpExpression in obj's subtree with a precise flag. Recursion descends through:

BinaryOpExpression(lhs, rhs) — tags itself, recurses into both operands.
UnaryOpExpression(operand) — tags itself, recurses into the operand (covers cast, bit_cast, neg, sqrt, sin, cos, log, exp, rsqrt, inv, ...).
TernaryOpExpression(op1, op2, op3) — recurses into all three (no precise field on it; just keeps inner ops reachable through qd.select(...) wrappers).

Stops at any other Expression kind: loads, constants, ndarray accesses, qd.func calls, etc. Semantics inside a qd.func body are governed by that body's own ops — wrap calls separately if needed.

Tagging is in-place on the underlying shared Expression nodes, so if you alias a subexpression and then wrap one alias, both uses get IEEE semantics (intentional — the tag is a property of the value, not the use site).

The Python entry point is in python/quadrants/lang/ops.py; bound to C++ via _qd_core.precise in quadrants/python/export_lang.cpp; backed by Expr precise(const Expr&) in quadrants/ir/expr.cpp.

Mechanism end-to-end

1. AST + IR — one bool field, two layers

Layer	File	Field added
Frontend AST	`quadrants/ir/frontend_ir.h`	`BinaryOpExpression::precise`, `UnaryOpExpression::precise`
Backend IR	`quadrants/ir/statements.h`	`BinaryOpStmt::precise`, `UnaryOpStmt::precise`

Both IR fields are added to QD_STMT_DEF_FIELDS(...) so:

clone() (via copy-ctor) carries them automatically.
same_statements IR-equality picks them up via field_manager.equal().
WholeKernelCSE therefore won't merge a precise op with a non-precise twin.

Propagation from AST to IR happens in BinaryOpExpression::flatten() and UnaryOpExpression::flatten() (quadrants/ir/frontend_ir.cpp) — set stmt->precise = expr->precise on the freshly-built BinaryOpStmt / UnaryOpStmt before ctx->push_back.

2. Walker — `precise()` free function

quadrants/ir/expr.cpp::precise(const Expr&). A worklist walk (not recursive, to keep stack depth bounded for deep AST chains common in scientific code) that descends through Binary/Unary/Ternary nodes and sets precise = true on Binary/Unary along the way.

3. Transforms — gate FP folds on `!stmt->precise`

File	Folds gated
`quadrants/transforms/alg_simp.cpp`	`0 * a → 0`, `a / a → 1`, `a / const → a * (1/const)`, `a - a → 0`, `a -^ a → 0`, `a == a → 1`, `a != a → 0`
`quadrants/transforms/binary_op_simplify.cpp`	`try_rearranging_const_rhs` and the rest of the FP simplification body

Each previously-fast-math-gated branch now also checks && !stmt->precise. The integer paths are untouched (precise has no meaning for integer ops; the field is ignored for them everywhere).

4. Codegen — three backends, two mechanisms

LLVM (CPU / CUDA / AMDGPU) — per-instruction FMF clear

At the bottom of TaskCodeGenLLVM::visit(BinaryOpStmt*) and visit(UnaryOpStmt*) (quadrants/codegen/llvm/codegen_llvm.cpp):

if (stmt->precise) {
  if (auto *inst = llvm::dyn_cast<llvm::Instruction>(llvm_val[stmt])) {
    if (llvm::isa<llvm::FPMathOperator>(inst)) {
      inst->setHasAllowReassoc(false);
      inst->setHasNoNaNs(false);
      inst->setHasNoInfs(false);
      inst->setHasNoSignedZeros(false);
      inst->setHasAllowReciprocal(false);
      inst->setHasAllowContract(false);
      inst->setHasApproxFunc(false);
    }
  }
}

The IRBuilder default-stamps fast-math flags on every FP instruction it creates (afn nnan ninf nsz reassoc). For precise ops we zero them out individually after construction.

Note: Instruction::setFastMathFlags(FastMathFlags{}) only OR's flags in on the LLVM version we ship — hence the explicit per-flag clears.

This handles all FP intrinsics: llvm.sqrt.f32, FAdd/FSub/FMul/FDiv, CreateMaxNum/MinNum, FCmp comparisons, FMA contractions.

SPIR-V (Metal / Vulkan, including MoltenVK) — per-op `NoContraction` decoration

quadrants/codegen/spirv/spirv_ir_builder.{h,cpp} — added a bool precise = false parameter to add/sub/mul/div/mod and a maybe_no_contraction(Value v, bool precise) helper that emits OpDecorate ... NoContraction when precise is true.

quadrants/codegen/spirv/spirv_codegen.cpp::visit(BinaryOpStmt*) and visit(UnaryOpStmt*) thread bin->precise / stmt->precise into the appropriate call sites.

SPIRV-Cross translates NoContraction into MSL's precise qualifier when it generates MSL — which is what MoltenVK uses internally on macOS. So this single decoration mechanism covers both native Vulkan drivers and the MoltenVK-on-Metal path.

CUDA libdevice — explicit fast/precise selection

CUDA is the only backend with explicit fast-vs-precise function-call selection that bypasses LLVM FMF. In quadrants/codegen/cuda/codegen_cuda.cpp, qd.sin/cos/log for f32 are emitted as direct calls to either __nv_fast_sinf (~few ULP, degrades sharply outside (-π/2, π/2)) or __nv_sinf (correctly rounded). LLVM FMF on the call instruction is meaningless because it's a plain function call; the function name itself encodes the approximation level.

Patched all three sites:

const bool use_fast = compile_config.fast_math && !stmt->precise;
llvm_val[stmt] = call(use_fast ? "__nv_fast_logf" : "__nv_logf", input);
// (also for sin and cos)

Other backends are unaffected by this class of issue:

CPU lowers qd.sin/cos/log/exp to libc calls (sinf/...) which are correctly rounded by default; qd.sqrt lowers to llvm.sqrt.f32 (handled by the FMF clear).
AMDGPU only emits __ocml_* (no __ocml_native_* fast variants in the codegen).
Metal/Vulkan unary ops go through SPIR-V OpExtInst GLSL.std.450 Sin/Cos/Log/.... The NoContraction decoration is emitted but per the SPIR-V spec it scopes to arithmetic instructions, so most consumers (including SPIRV-Cross's MSL path) will ignore it on ExtInst results. On those paths correctness relies on the driver's default (non-fast-math) stdlib, which is typically within 1-2 ULP for in-range inputs. The decoration is kept as best-effort future-proofing.

5. Offline cache key — content-addressed, must include the new field

quadrants/analysis/gen_offline_cache_key.cpp — the AST-level cache key generator emits expr->precise for both BinaryOpExpression and UnaryOpExpression. Without this, two structurally-identical kernels (one with qd.precise(...) wrappers, one without) would hash to the same key, and the second to compile would silently reuse the first's compiled artifact. The merged binary test catches this regression directly.

Note: the addition is unconditional — kernels that don't use qd.precise(...) will get a different cache hash than they had on previous builds, so the on-disk cache invalidates once on upgrade. This is intentional; we don't want backward cache compatibility for a mode that affects correctness.

Per-backend coverage matrix

Backend	Reassoc / contraction / algebraic folds	Approx transcendentals (`sin`/`cos`/`log`)
CPU	LLVM FMF clear ✓	libc `sinf` is correctly rounded — N/A
CUDA	LLVM FMF clear ✓	`codegen_cuda.cpp` libdevice selection ✓
AMDGPU	LLVM FMF clear ✓	`__ocml_*` already correctly rounded — N/A
Vulkan (native + MoltenVK)	SPIR-V `NoContraction` decoration ✓	decoration emitted but best-effort only on ExtInst; driver default stdlib (typ. 1-2 ULP)
Metal	SPIR-V `NoContraction` → MSL `precise` ✓	same as Vulkan

Tests — `tests/python/test_precise.py`

Two tests, both run on every available backend.

`test_qd_precise_protects_fast_math` — observable on all backends

Defines a Dekker 2Sum accumulator (1000 × 1e-8 added into 1.0) twice in the same test:

df_accum_naive — unprotected; under fast_math=True the compensation term lo collapses to zero (this is the bug qd.precise exists to fix; asserted as the negative control).
df_accum_precise — every FP op wrapped in qd.precise(...); asserts lo is non-trivially non-zero AND that hi + lo beats the naïve f32 sum by >1000×.

The same test also implicitly validates the offline-cache key fix: the two kernels are structurally identical apart from the qd.precise(...) wrappers, so if gen_offline_cache_key.cpp didn't account for the precise field, the second compile would silently reuse the first's artifact and the precise version would behave like the naive — comment in the test spells this out so future readers don't accidentally split the kernels back into separate tests and lose the coverage.

`test_qd_precise_unary_rounding` — verifies the unary path on every backend

Computes qd.precise(qd.sin/cos/log(x)) for inputs [0.5, 1.5, 2.5, 4.0, 7.0, 10.0, 25.0, 50.0] (chosen to span both the central range AND values where some backends' fast-math approximations are known to degrade). Asserts the result is within 2 ULP of the correctly-rounded f32 reference computed via numpy in f64 then narrowed.

The naive (non-precise) variant is deliberately not part of this test — on most backends fast_math=True already produces correctly-rounded transcendentals, so a naive-vs-precise comparison would be uninformative. The test instead asks the right question directly: "is qd.precise(unary) correctly rounded?"

CUDA validation needed. I don't have CUDA hardware to validate this locally. The expected behavior on CUDA: this test should pass with both commits applied, and should fail if you revert just the codegen_cuda.cpp chunk (or the whole unary commit) — because __nv_fast_sinf/cosf/logf exceed the 2 ULP tolerance for several of the chosen inputs. If it passes on CUDA regardless, something in my reasoning is off and we should dig in.

Side-effect audit (what can't go sideways)

I went through every place in the codebase where the IR gets hashed, compared, serialized, or content-deduped. Findings:

Concern	Where checked	Verdict
Offline cache key (C++)	`gen_offline_cache_key.cpp`	Was broken for both Binary and Unary, fixed
Python fastcache	`python/quadrants/lang/_fast_caching/src_hasher.py`	Hashes Python source only; naturally distinguishes `qd.precise(...)` calls
Stmt clone / serialization	`QD_STMT_DEF_FIELDS(..., precise)` + auto copy-ctor	Auto-covered
IR equality (`same_statements`)	`field_manager.equal()` sees `precise` via `QD_STMT_DEF_FIELDS`	Auto-covered
Whole-kernel CSE	`transforms/whole_kernel_cse.cpp` falls back to `same_statements` for equality	Auto-covered
Codegen — non-precise path	All call sites default `precise=false`, gates everywhere	Pure no-op for non-precise; no perf regression
IR printer / dumper	`transforms/ir_printer.cpp`, `expression_printer.h`	Doesn't emit `precise` — pure debug visibility, not load-bearing

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1eae2b4bcb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hughperkins · 2026-04-13T14:39:03Z

Could we add some user facing doc please?

duburcqa

Second pass — most of the review is inline. Overall the design (per-op IEEE-strict tag, backend-specific lowering) lands cleanly; my concerns are a narrow correctness gap in try_rearranging_const_rhs, a likely-overselling claim about SPIR-V NoContraction covering transcendentals, and a stale header comment. Plus a fragile negative-control assertion in the tests. Details inline.

hughperkins · 2026-04-13T14:57:13Z

PR description is AMAZING 🔥

duburcqa

Second pass (from scratch). Five new findings, all lower priority than the first round. Most concern alg_simp.cpp rewrites that aren't gated on precise and therefore can bypass the IEEE-strict intent. Details inline.

hughperkins · 2026-04-13T15:07:49Z

Once you've added user-facing doc, pelase could you work your way through the AI prompts at https://www.notion.so/Meta-Quadrants-code-reviews-in-the-presence-of-AI-coding-33e0c06194e18080920dd87681976146?source=copy_link#3400c06194e180fe98d1e835a77d4246 , pasting the resulting conversation into this PR.

Here are the questions (Excluding a question about comment deletion, and about comments being 120 cahracter aligned):

to what extend do the tests match what is described in the doc?
is there any code that is duplicate by or in this PR that we could refactorize better somehow?
Is there any code or code change in this PR that might be embarrassing in some way if published in a public github repo?
Are there any layout inconsistencies in this code relative to itself and to other existing repo code?
Are there any coding inconsistencies in this code relative to itself and to other existing repo code?
Is there anything in this PR that could be implemented in less code?
Is there anything in this PR that could be implemented in a way that will run faster at runtime?
Is there anything in this PR which might exhibit incorrect behavior in some way?
check the test coverage of the tests. Is there any code that is being added or modified in this PR, that is not being exercised by the tests? Is there any functionality that is implied by consistency or otherwise that is not being tested by these tests?
check the factorization of the tests. are there some tests that could be merged, e.g. making use of parametrize? reminder: you can use if qd.static(...) in quadrants kernels to switch between different compiled code.

I've been doing this by hand, but it takes a while. So, if you don't want to do it by hand, two alternatives I'm ok-ish with (but havent tried yet):

giving AI all questions at once, and telling it to spawn a subagent for each question, or
writing the questions to a file, telling the AI to read one question at a time from the file, answer it fully, and only then address the next question

In all cases, totally fine to ask the AI to write its answer to a .md file, or similar, so it's convenient to copy and paste. (though I've just been doing it by hand, so far, but I've been thinking doing ti by hand might not be the most efficient approach...)

Notes:

require using model Opus 4.6.
you can skip the last two questions, about comments being deleted, and comments not 120character wrapped, since 1. I'm sure that's not an issue in your PRs, 2. you are the main person who cares about these two things I feel :)

hughperkins · 2026-04-13T15:10:20Z

(I should probably add a question like:

is there anything in this PR that could impact runtime performance
... but your code only affects compile-speed I think?)

duburcqa

Third pass. Prior-round items are all addressed in commits 5438801 and 2669dc5 (inner-precise bail in try_rearranging_const_rhs, stale header comment, SPIR-V transcendental claim walked back, negative-control dropped, __all__ position, signed-zero gate, pow gate, sqrt in unary test, CUDA use_fast hoist). Only residual I see is three non-ASCII chars that slipped through the earlier scrubbing. Details inline.

duburcqa · 2026-04-13T15:45:22Z

To what extent do the tests match what is described in the doc?

Tests cover the primary doc-claimed contracts:

Binary FP op protection (Dekker 2Sum, test_qd_precise_protects_fast_math)
Unary transcendentals within 2 ULP (test_qd_precise_unary_rounding, covers sin/cos/log/sqrt)
Cache-key distinction (indirectly, via the two-kernel structure)

Documented-but-untested behaviors:

TernaryOp (qd.select) recursion
Walker stopping at qd.func call sites (wrap-at-caller is no-op)
Alias sharing ("tag travels with the value")
Vector/Matrix ValueError
cast / bit_cast tagging specifically
Idempotency when fast_math=False

Strong on primary use case, thin on edge cases covered in the "Where the walker stops" and "Caveats" sections.

Any code duplicated that could be refactorized better?

One clear candidate: the LLVM FMF-clear block in codegen_llvm.cpp is duplicated at the end of visit(UnaryOpStmt*) (~17 lines) and visit(BinaryOpStmt*) (~17 lines). Could be a ~10-line helper:

static void clear_fast_math_if_precise(bool precise, llvm::Value *v) {
if (!precise) return;
auto *inst = llvm::dyn_castllvm::Instruction(v);
if (!inst || !llvm::isallvm::FPMathOperator(inst)) return;
inst->setHasAllowReassoc(false);
inst->setHasNoNaNs(false);
inst->setHasNoInfs(false);
inst->setHasNoSignedZeros(false);
inst->setHasAllowReciprocal(false);
inst->setHasAllowContract(false);
inst->setHasApproxFunc(false);
}

Saves ~14 lines net. Other patterns: the alg_simp (fast_math && !stmt->precise) recurs ~6 times but factoring would hurt readability; the SPIR-V DEFINE_BUILDER_BINARY_* macros are already factored.

Anything embarrassing if published?

Nothing significant. The one overstated claim (SPIR-V NoContraction covering transcendentals) was caught in review and walked back in both the code comment at spirv_codegen.cpp:888-895 and the PR description's coverage matrix. Code is honest about the best-effort nature
on that path.

Layout inconsistencies vs repo?

None spotted. test_precise.py follows the existing @test_utils.test + kernel + assertion style. precise.md matches static.md in structure (title, intro, sections). Field placement in BinaryOpStmt / UnaryOpStmt and inclusion in QD_STMT_DEF_FIELDS matches existing
convention. all in ops.py isn't strictly alphabetical anywhere; my entry sits next to cast which is loosely grouped.

Coding inconsistencies vs repo?

None significant. bool precise{false} default-init matches existing IR field style. maybe_no_contraction helper in SPIR-V builder uses the same this->decorate(...) pattern as nearby code. LLVM per-flag clear matches how setHas*(false) is used elsewhere (there's no
unified utility in upstream LLVM we ship).

Anything implementable in less code?

Modestly yes: the LLVM FMF-clear refactor in Q2 saves ~14 lines. The walker is as compact as it can be. SPIR-V builder plumbing is bound by macro symmetry (adding optional precise to 5 existing binary ops). Not much further reducible without losing clarity.

Anything that could run faster at runtime?

qd.precise is a compile-time primitive with zero runtime cost on un-wrapped paths. Wrapped paths are slower by design (FMF-cleared LLVM ops, NoContraction, non-fast libdevice). Nothing to chase at runtime.

Compile time: O(n) walker over expression AST, two extra emit(bool) calls in cache-key generation. Negligible.

Anything that might exhibit incorrect behavior?

One known best-effort limitation (documented): qd.precise(qd.sin/cos/log(x)) on SPIR-V backends. SPIR-V spec scopes NoContraction to arithmetic instructions; most consumers ignore it on the OpExtInst calls used for transcendentals. Correctness on those paths relies on
the driver's default (non-fast-math) stdlib being accurate enough. Acknowledged in spirv_codegen.cpp:888-895, precise.md backend table, and PR description.

Otherwise: alg_simp gates, try_rearranging_const_rhs bail, pow branch guard, LLVM FMF clear, and CUDA libdevice opt-out collectively cover identified paths. Clone / IR-equality / CSE / cache-key all auto-carry precise via QD_STMT_DEF_FIELDS.

Test coverage gaps?

(Expanded from Q1.) Code added but not directly tested:

Walker recursion through TernaryOpExpression (code present in expr.cpp:73, no test)
UnaryOpExpression::flatten() propagation exercised only via sin/cos/log/sqrt (implicit)
alg_simp a + 0 / 0 + a gate (observed via Dekker, not directly)
pow branch precise-bail (not tested at all)
Offline cache-key emit for precise (observed via the Dekker pass/fail, not directly)
is_quadrants_class rejection path in Python

Biggest gaps: pow branch and the Python rejection are entirely untested.

Factorization / parametrize?

The two tests are fundamentally different contracts, so merging them doesn't fit. Within test_qd_precise_unary_rounding, sin/cos/log/sqrt could be split with pytest.parametrize but bundling in one kernel is more compile-efficient.

test_qd_precise_protects_fast_math runs two kernels specifically to exercise the cache-key distinction - could reformulate as:

@qd.kernel
def df_accum(in_arr, out, use_precise: qd.Template):
...
if qd.static(use_precise):
s, e = two_sum_precise(hi, in_arr[i])
else:
s, e = two_sum_naive(hi, in_arr[i])

Same compile outcome (two kernel variants under the hood), marginally cleaner. Not a big win; current structure is defensible.

Runtime performance impact?

Un-wrapped code: zero. precise field defaults false; every codegen gate is if precise: apply slower variant, so non-precise paths emit identical code to pre-PR.
Wrapped code: slower by design (FMF-cleared, NoContraction, non-fast libdevice).
Compile time: extra bool field, two extra cache-key hashes, O(n) walker. Negligible.
One-time on-disk cache invalidation on upgrade because the cache-key format changed unconditionally. Documented; intentional.

duburcqa · 2026-04-13T15:58:30Z

Docs: added docs/source/user_guide/precise.md (linked from the user guide TOC under Core concepts, next to static.md). Covers what qd.precise does, walker scope rules, interaction with fast_math, a per-backend coverage table (honest about the SPIR-V transcendentalvcaveat), and a Dekker 2Sum worked example.

Follow-ups arising from the review questions below (in c5fbab6):

Refactored the duplicated LLVM FMF-clear block in codegen_llvm.cpp into a disable_fast_math(llvm::Value*) helper (dedupes visit(UnaryOpStmt*) and visit(BinaryOpStmt*); the if (stmt->precise) check stays explicit at the call sites).
Added test_qd_precise_rejects_quadrants_classes covering the Vector/Matrix ValueError that was previously documented but untested.
Added test_qd_precise_recurses_through_select covering the TernaryOp walker path. Uses signed-zero observation - qd.precise(qd.select(c, x[0] + 0.0, 0.0)) with x[0] = -0.0 yields +0.0, unwrapped yields -0.0 - so we verify the walker tags the inner add through qd.select. Uses a runtime load to defeat ConstantFold.

duburcqa · 2026-04-13T16:02:35Z

@claude review

duburcqa · 2026-04-13T16:26:52Z

I was assisted by Claude Opus to write this PR. I have read and review every changes in this PR. I take full responsibility for the lines added and removed in this PR. I won't blame any issue on Claude Opus. I have reviewed this PR, and approve it.

hughperkins · 2026-04-13T16:54:54Z

From your earlier questions to AI, please could we add the missing tests that were identifeid:

TernaryOp (qd.select) recursion
Walker stopping at qd.func call sites (wrap-at-caller is no-op)
Alias sharing ("tag travels with the value")
Vector/Matrix ValueError
cast / bit_cast tagging specifically
Idempotency when fast_math=False

hughperkins · 2026-04-13T17:08:32Z

Responses from Opus for AI questiosn above, using one agent per question:

Note: I am NOT requesting these changes. I'm posting them so we have the information. I'm going to read these changes after posting, and request any changes from this post that I feel make sense to me.

To what extent do the tests match what is described in the doc?

Q1: To what extent do the tests match what is described in the doc?

What the doc says `qd.precise` does

The doc describes qd.precise(expr) as a per-expression opt-in to IEEE-strict floating-point semantics. It prevents reassociation, FMA contraction, approximate-transcendental substitution, and algebraic simplification (e.g. a - a -> 0, a / a -> 1), regardless of the module-level fast_math setting. The doc covers:

Core behavior: protect compensated-arithmetic patterns (Dekker 2Sum) from fast-math folding
Walker tagging: every BinaryOp and UnaryOp in the subtree is tagged
Recursive descent: walker descends through BinaryOp, UnaryOp, and TernaryOp (e.g. qd.select)
Composite expressions: wrapping qd.precise(qd.sqrt(a * a + b * b)) tags all three inner FP ops
Walker stop points: loads, constants, qd.func call sites, atomic ops
Interaction with fast_math=False: precise is redundant but harmless
Backend coverage: CPU/CUDA/AMDGPU (LLVM FMF), Vulkan/Metal (SPIR-V NoContraction)
Scalar-only caveat: Vector/Matrix raises
Aliasing caveat: tag is a property of the expression value, not the use site; aliasing shares the tag

What IS tested

Doc behavior	Test	How
Core purpose: prevent algebraic folding under `fast_math=True`	`test_qd_precise_protects_fast_math`	Full Dekker 2Sum accumulator; asserts compensation term is non-zero and compensated sum beats naive f32 by >1000x
UnaryOp tagging (sin, cos, log, sqrt)	`test_qd_precise_unary_rounding`	Verifies results are within 2 ULP of correctly-rounded f32 reference with `fast_math=True`
Walker descends through `TernaryOp` (`qd.select`)	`test_qd_precise_recurses_through_select`	Exploits signed-zero semantics: `(-0.0) + 0.0` must yield `+0.0` under IEEE, so the inner add must be tagged
Scalar-only caveat: `Vector`/`Matrix` raises	`test_qd_precise_rejects_quadrants_classes`	`pytest.raises(ValueError)` for both `Vector` and `Matrix`
Offline-cache key distinguishes precise from non-precise	`test_qd_precise_protects_fast_math` (indirect)	Runs naive kernel first; if cache key didn't account for `precise`, the precise kernel would silently reuse the naive artifact

What is NOT tested

Doc behavior	Gap
Composite unary+binary recursion: `qd.precise(qd.sqrt(a * a + b * b))`	The doc's "What gets protected" section explicitly shows the walker descending through a `UnaryOp` (sqrt) to tag inner `BinaryOp`s (mul, add). No test exercises this path. `test_qd_precise_unary_rounding` wraps individual unary ops over a single load (`qd.sin(x[i])`), so the walker never encounters an inner `BinaryOp` beneath a `UnaryOp`.
Walker stops at loads	No test verifies the walker does not descend into ndarray indexing or field access.
Walker stops at constants	Not tested.
Walker stops at `qd.func` call sites	The doc explicitly warns that wrapping at the call site does not protect ops inside the callee. No test verifies this boundary.
Walker stops at atomic ops	Not tested.
`fast_math=False` + `qd.precise`	The doc's interaction table says precise is "redundant but harmless" when `fast_math=False`. No test runs with `fast_math=False` to confirm this.
FMA contraction prevention	The doc lists FMA contraction as a distinct optimization that `qd.precise` defeats. No test isolates this (the 2Sum test may indirectly exercise it, but doesn't assert on it).
Reassociation prevention	Listed as a distinct optimization. No test isolates `(a + b) + c` vs `a + (b + c)` reordering.
`a / a -> 1` algebraic fold	The doc lists this explicitly. Not tested (`a - a -> 0` is covered indirectly by 2Sum).
Aliasing caveat	The doc says "if you alias a subexpression and then wrap one alias, both uses get IEEE semantics." No test verifies this shared-tag behavior.
Other unary ops: `neg`, `exp`, `rsqrt`, casts, `bit_cast`	Only sin/cos/log/sqrt are tested. The doc's list is longer.
Other binary ops: `%`, comparisons, bit ops on FP types	Only `+`, `-`, `*` are exercised (via 2Sum).
Backend-specific behavior	Tests are backend-agnostic by design (run on whatever backend is available). No test targets SPIR-V `NoContraction` decoration, CUDA libdevice fast/non-fast variant selection, or the best-effort transcendental caveat on Vulkan/Metal. This is arguably appropriate for a Python-level test suite, but the CUDA libdevice path change (the diff shows `__nv_fast_sinf` vs `__nv_sinf` selection) has no dedicated coverage.

Tests that go BEYOND the doc

Offline-cache key validation (test_qd_precise_protects_fast_math): The doc never mentions cache key generation. The test deliberately runs the naive kernel first to smoke-test that the cache key serializer (changed in gen_offline_cache_key.cpp) correctly differentiates precise from non-precise ops.
Signed-zero semantics (test_qd_precise_recurses_through_select): The doc does not discuss signed zeros. The test uses -0.0 + 0.0 yielding +0.0 as the observable signal that alg_simp's a + 0 -> a fold was blocked â€” a clever technique not mentioned in the doc.
pow precise-bail note: The test file includes a trailing comment explaining why pow under qd.precise is deliberately untested (the rewrites are IEEE-equivalent, so no observable difference). The doc does not mention pow at all, but the implementation (visible in the diff's alg_simp.cpp changes) does gate pow rewrites on !stmt->precise.

Summary

The tests cover the primary use case (compensated arithmetic surviving fast_math=True) thoroughly and with a well-designed end-to-end workload. They also cover the unary-op path, the TernaryOp recursion, and the scalar-only error. The main gaps are around walker boundary conditions (stops at loads, constants, qd.func calls, atomics), the composite unary-wrapping-binary recursion the doc highlights, the aliasing caveat, the fast_math=False interaction, and individual algebraic-simplification rules beyond a - a -> 0. These are mostly "negative" or "boundary" tests rather than core-functionality tests, but several correspond to explicitly documented behaviors that a reader would expect to see exercised.

Is there any code that is duplicate or could be refactored?

Overall the PR is well-structured. Most patterns exist for good reason (the precise flag genuinely needs to be threaded through four separate layers: Expression â†’ Stmt â†’ transform â†’ codegen). That said, there are a few concrete refactoring opportunities:

1. Repeated `(fast_math && !stmt->precise) || is_integral(...)` guard â€” `alg_simp.cpp`

The expression (fast_math && !stmt->precise) || is_integral(stmt->ret_type.get_element_type()) (or a close variant using stmt->lhs->ret_type) appears five times in quadrants/transforms/alg_simp.cpp:

Line 348 (0 * a â†’ 0)
Line 402 (a / a â†’ 1)
Line 408 (standalone fast_math && !stmt->precise && ... for reciprocal rewrite)
Line 464 (a - a â†’ 0)
Line 513 (a == a â†’ 1)

Before this PR, the pattern was already fast_math || is_integral(...) repeated in the same spots; the PR mechanically inserted && !stmt->precise into each one. This is a good candidate for a small inline helper:

// True when the compiler may apply fast-math-style algebraic folds to `stmt`.
bool may_fold(BinaryOpStmt *stmt) const {
  return (fast_math && !stmt->precise) || is_integral(stmt->ret_type.get_element_type());
}

This would remove five (soon more, as new simplifications are added) repetitions of the same three-part condition and make future additions less error-prone.

2. Identical `if (stmt->precise) { disable_fast_math(...); }` blocks â€” `codegen_llvm.cpp`

quadrants/codegen/llvm/codegen_llvm.cpp has the same two-line block in two places (lines 496â€“498 for UnaryOpStmt and lines 776â€“778 for BinaryOpStmt):

if (stmt->precise) {
  disable_fast_math(llvm_val[stmt]);
}

This is minor (two occurrences, three lines each), but if TernaryOpStmt ever gains a precise flag â€” or if the post-processing grows (e.g. adding metadata) â€” a small template helper or lambda would keep things DRY:

auto maybe_clear_fmf = [&](auto *s) {
  if (s->precise) disable_fast_math(llvm_val[s]);
};

Not critical today but worth noting.

3. Near-identical `bool precise{false}` field + comment across four IR classes

The bool precise{false}; field is added to four classes, each with a near-identical two-line doc comment ("Mirrors MSL/HLSL precise"):

UnaryOpExpression (frontend_ir.h:375â€“377)
BinaryOpExpression (frontend_ir.h:401â€“403)
UnaryOpStmt (statements.h:158â€“160)
BinaryOpStmt (statements.h:254â€“256)

The comments are subtly inconsistent (the UnaryOpStmt version says "no contraction, no approximate implementations" while BinaryOpStmt says "no reassociation, no contraction, no algebraic folds"). This isn't a code-level refactoring opportunity per se (C++ doesn't have a great mechanism for sharing a field across unrelated class hierarchies), but the comment wording should be unified. A single canonical comment could be defined next to the precise() function in expr.h and the field-level comments shortened to a cross-reference.

4. `precise` propagation in `flatten()` â€” parallel boilerplate

The propagation from Expression â†’ Stmt happens via identical one-liners in two flatten() methods:

UnaryOpExpression::flatten() at frontend_ir.cpp:264: unary->precise = precise;
BinaryOpExpression::flatten() at frontend_ir.cpp:434: bin_stmt->precise = precise;

This is inherent to the current flatten-per-class design and is not independently refactorable without reworking the flatten infrastructure. Noting for completeness only â€” no action needed.

5. Test helpers: `two_sum_naive` / `two_sum_precise` and `fast_two_sum_naive` / `fast_two_sum_precise`

In tests/python/test_precise.py (lines 11â€“59), the test defines four @qd.func helpers that come in naive/precise pairs:

two_sum_naive vs two_sum_precise â€” identical structure, differ only in qd.precise(...) wrappers
fast_two_sum_naive vs fast_two_sum_precise â€” same

The duplication is intentional (the test needs both variants to compare behavior), so this is fine as-is. A parameterized factory could reduce lines but would hurt test readability. No action recommended.

6. SPIR-V `maybe_no_contraction` indirection

spirv_ir_builder.h declares maybe_no_contraction(Value v, bool precise) and spirv_ir_builder.cpp:678 defines it as:

void IRBuilder::maybe_no_contraction(Value v, bool precise) {
  if (precise) {
    this->decorate(spv::OpDecorate, v, spv::DecorationNoContraction);
  }
}

This is then called in:

DEFINE_BUILDER_BINARY_USIGN_OP macro (line 692)
DEFINE_BUILDER_BINARY_SIGN_OP macro (line 706)
IRBuilder::mod() (line 725)
spirv_codegen.cpp:896 (unary op visitor)

The helper is well-factored. No issue.

Summary

Priority	Item	Location	Recommendation
Medium	Repeated `(fast_math && !stmt->precise) \|\| is_integral(...)` guard	`alg_simp.cpp` Ã—5	Extract a `may_fold()` helper
Low	Identical `if (stmt->precise) disable_fast_math(...)` blocks	`codegen_llvm.cpp` Ã—2	Extract a lambda or template helper
Low	Inconsistent `precise` field comments	`frontend_ir.h`, `statements.h` Ã—4	Unify wording
None	Test helper duplication	`test_precise.py`	Intentional; no action
None	`flatten()` one-liners	`frontend_ir.cpp` Ã—2	Structural; no action

Is there any code that might be embarrassing if published publicly?

Overall verdict: No. This is a clean, professional PR. There are no secrets, offensive content, debug leftovers, or placeholder values. A few minor cosmetic items are worth noting:

1. Missing newline at end of `precise.md`

The new documentation file docs/source/user_guide/precise.md is missing a trailing newline (the diff shows \ No newline at end of file on line 130). This is a trivial lint issue but will show up in many editors and CI linters.

2. PR perpetuates the `ARTHIMATIC` typo

In quadrants/codegen/spirv/spirv_codegen.cpp, the macro BINARY_OP_TO_SPIRV_ARTHIMATIC is a pre-existing misspelling of "ARITHMETIC". This PR modifies the macro body (adding the bin->precise argument) but keeps the misspelled name. While not introduced by this PR, touching the macro without fixing the name is a missed opportunity and the typo will continue to be visible in the codebase.

3. `precise` inserted out of order in `all`

In python/quadrants/lang/ops.py, "precise" is inserted between "cast" and "ceil" (line 1564). The surrounding entries in that section of __all__ are roughly alphabetical (bit_cast, bit_shr, cast, ceil, cos, exp, ...), so precise should appear later, after pow. The test_api.py placement is correct (between pow and profiler), making the inconsistency more visible.

4. Empty `if` body as a control-flow guard

In quadrants/transforms/alg_simp.cpp, the pow precise-bail is written as:

if (stmt->precise) {
    // Preserve the user's `pow()` call verbatim. The helpers below rewrite into sqrt/mul/div chains
    // whose synthesized stmts inherit `precise=false`, stripping the IEEE-strict tag.
} else if (exponent_one_optimize(stmt)) {

An empty if body with only a comment is a recognized pattern, and the comment explains the intent well, but some style guides flag it. A return; or early-continue would be more explicit.

Items explicitly NOT found

No hardcoded secrets, API keys, or passwords.
No offensive or inappropriate language in comments or variable names.
No TODO/HACK/FIXME comments introduced by this PR (the ones visible in the diff context, e.g. // FIXME: figure out why OpSRem does not work and // TODO: remove this field, are pre-existing).
No placeholder/dummy values that shouldn't ship.
No debug printf/print/std::cerr statements left in.
No copyright issues.
No misleading comments â€” the comments are accurate, thorough, and match the code behavior.
Documentation is high quality with correct examples and a detailed backend-coverage table.

Are there any layout inconsistencies?

Overall the PR is well-formatted: C++ uses 2-space indentation with attached braces, Python uses 4-space indentation, and comment style is consistent with existing conventions. The .clang-format 120-column limit is respected throughout. That said, there are a few specific inconsistencies:

1. `all` ordering in `python/quadrants/lang/ops.py` (line 1564)

"precise" is inserted between "cast" and "ceil", breaking the alphabetical ordering that the surrounding entries follow:

    "bit_cast",
    "bit_shr",
    "cast",
    "precise",
    "ceil",
    "cos",

Alphabetically, "precise" belongs after "pow" (near line 1584). By contrast, tests/python/test_api.py correctly places it between "pow" and "profiler":

    "pow",
    "precise",
    "profiler",

The existing __all__ list isn't perfectly sorted (the tail â€” "max", "min", "select", "abs", "pow" â€” is out of order), but inserting "precise" right after "cast" in the middle of an alphabetically-sorted run is visually jarring and clearly unintentional.

2. Missing trailing newline in `docs/source/user_guide/precise.md`

The diff shows \ No newline at end of file at the end of the new documentation file. Every other .md file in docs/source/user_guide/ ends with a newline. This violates POSIX text-file conventions and will produce a git diff warning.

3. Inconsistent `QD_NOT_IMPLEMENTED` reformatting in `quadrants/codegen/spirv/spirv_codegen.cpp`

The PR reformats the one-liner in the unary visit() to a proper multi-line block:

  else {
    QD_NOT_IMPLEMENTED
  }

But the structurally identical pattern in the binary visit() is left as a single line:

  else {QD_NOT_IMPLEMENTED} ir_->register_value(bin_name, bin_value);

This creates an internal inconsistency in the same file. If the unary path was worth reformatting (which it was â€” AllowShortBlocksOnASingleLine: false in .clang-format), the binary path should get the same treatment.

4. No other layout issues found

The following were checked and are consistent:

C++ indentation: 2-space indent throughout, matching .clang-format IndentWidth: 2. No tabs.
Brace placement: Attached braces (BreakBeforeBraces: Attach) used consistently in all new code.
C++ comment style: // line comments with 2 spaces before trailing comments, matching SpacesBeforeTrailingComments: 2.
C++ line lengths: All new/changed lines are within the 120-column limit.
Python indentation: 4-space indent throughout, PEP 8 compliant.
Python docstring format: The precise() docstring in ops.py uses the same Google-style format with Args:, Returns:, Notes:, and Example:: sections as the neighboring cast() and bit_cast() functions.
Python import ordering: test_precise.py follows the same import numpy / import pytest / import quadrants / from tests import test_utils ordering as other test files (e.g., test_cast.py).
Include ordering: No new includes are added.
Macro formatting: The DEFINE_BUILDER_BINARY_USIGN_OP / DEFINE_BUILDER_BINARY_SIGN_OP macros in spirv_ir_builder.cpp use aligned backslash continuation, consistent with the previous version and with neighboring macros like DEFINE_BUILDER_CMP_OP.
Blank line conventions: New function definitions are separated by blank lines; if (stmt->precise) blocks are preceded by a blank line for visual separation, matching the existing pattern of adding breathing room before guard clauses.
Header field ordering: The bool precise{false} fields in frontend_ir.h and statements.h are placed after the existing fields and before constructors, consistent with the existing cast_type field placement in UnaryOpExpression.

Are there any coding inconsistencies?

Overall the PR is well-written and follows existing patterns closely. A few inconsistencies remain, ranging from minor style nits to potential correctness concerns:

1. `all` ordering: `"precise"` breaks alphabetical sort

The __all__ list in ops.py is alphabetically sorted (bit_cast, bit_shr, cast, ceil, cos, â€¦). The PR inserts "precise" immediately after "cast" (line 1564), violating the sort order â€” it should appear after "pow" and before "profiler" (or wherever p-entries belong). This mirrors the alphabetical intent of the existing list but was not respected.

2. `Stmt::make<>` vs `Stmt::make_typed<>` inconsistency within the same file

In alg_simp.cpp, the PR changes one Stmt::make<BinaryOpStmt> to Stmt::make_typed<BinaryOpStmt> for the 2 * a â†’ a + a rewrite (line 377) so that it can set sum->precise. However, every other synthesized BinaryOpStmt in the same file still uses Stmt::make<BinaryOpStmt> (lines 224, 233, 270, 272, 362, 416, 429). The mix of make<> and make_typed<> for the same type within a single file is a style inconsistency. Either is fine, but the PR should pick the one already dominant in the file (make<>), or convert them all to make_typed<>.

3. Synthesized stmts in `alg_simp.cpp` don't propagate `precise` â€” inconsistent treatment

The 2 * a â†’ a + a rewrite correctly copies stmt->precise to the new sum stmt. But several other rewrites in the same file create new BinaryOpStmt / UnaryOpStmt nodes that inherit precise=false by default:

a ** 0.5 â†’ sqrt(a) (line 186): creates UnaryOpStmt(sqrt) without copying precise.
a ** n â†’ exponentiation by squaring (lines 224, 233): creates mul stmts without precise.
a ** -n â†’ 1 / a**n (lines 268â€“272): creates neg, pow, div stmts without precise.
a / const â†’ a * (1/const) (line 416): creates mul without precise.

The PR handles this for pow by bailing out entirely when stmt->precise is set (the if (stmt->precise) { ... } guard at line 470), which is correct for pow. But the a / const â†’ a * (1/const) rewrite is guarded only by fast_math && !stmt->precise at a higher level (line 408), and the sqrt and exponentiation-by-squaring rewrites are not behind a precise guard at all. If a user wrote qd.precise(x ** 0.5), the exponent_half_optimize path would fire and produce an untagged sqrt, silently dropping the precise intent. This is a functional gap, not just a style issue.

4. `UnaryOpStmt::same_operation()` does not account for `precise`

same_operation() (in statements.cpp, line 25) only compares op_type and cast_type. Two UnaryOpStmts that differ only in precise will be reported as the "same operation." This could cause CSE or other passes that use same_operation() to merge a precise op with a non-precise one. The QD_STMT_DEF_FIELDS macro does include precise, so the field_manager.equal() path in same_statements.cpp is correct, but any call site relying on same_operation() alone could be a problem.

5. `TernaryOpExpression` and `TernaryOpStmt` lack a `precise` field

The walker in expr.cpp descends through TernaryOpExpression and tags its children, but never sets a precise flag on the TernaryOpExpression itself. TernaryOpStmt likewise has no precise field. This is internally consistent (select/ifte are not FP arithmetic), but it means the offline cache key generator (gen_offline_cache_key.cpp) does not emit precise for ternary nodes â€” which is fine today since they don't carry the flag, but creates an asymmetry: if a ternary's children are tagged precise, the ternary's cache key stays the same, and correctness relies entirely on the children's cache keys distinguishing themselves. This is correct but worth documenting as intentional.

6. The `BINARY_OP_TO_SPIRV_ARTHIMATIC` macro name preserves a misspelling

The macro is spelled ARTHIMATIC (should be ARITHMETIC). This is a pre-existing issue, but the PR widens the macro's signature (adding the bin->precise argument) without fixing the typo, which could have been a low-cost cleanup. Minor, but an opportunity missed.

7. Python-side: no `@quadrants_scope` decorator on `precise()`

bit_cast() and cast() in ops.py do not use @quadrants_scope either, so precise() is consistent with those immediate neighbors. However, many other ops in the file do use the decorator. Since precise() only makes sense inside a kernel/func body (it calls _qd_core.precise which requires an active IR context), the lack of the decorator means the error message for out-of-scope usage will be a cryptic C++ exception rather than a clean Python-level error. This matches cast/bit_cast behavior, so it is not a new inconsistency, but it is a missed opportunity to be consistent with the majority of the file.

8. Error type: `ValueError` vs `TypeError`

precise() raises ValueError("Cannot apply precise on Quadrants classes"). The existing bit_cast() does the same (ValueError). However, passing an object of the wrong type is arguably a TypeError, not a ValueError. This is consistent with bit_cast but inconsistent with Python conventions. Minor.

9. Docstring in `ops.py` says "every binary FP op" but code also tags unary ops

The docstring for precise() says "Every binary FP op inside obj is evaluated in source orderâ€¦" and the Returns section says "with every reachable binary op tagged as precise". But the C++ implementation (and the longer body of the docstring) also tags UnaryOpExpression nodes. The summary line and Returns section are inaccurate â€” they should say "binary and unary FP ops."

Could anything be implemented in less code?

1. Extract `(fast_math && !stmt->precise)` into a helper â€” 5 call sites in `alg_simp.cpp`

The condition (fast_math && !stmt->precise) is repeated verbatim 5 times in quadrants/transforms/alg_simp.cpp (lines 348, 402, 408, 464, 513). Four of those also share the larger compound pattern ((fast_math && !stmt->precise) || is_integral(...)). A one-line helper on AlgSimp would collapse all of them:

bool may_fold_fp(const BinaryOpStmt *stmt) const {
  return fast_math && !stmt->precise;
}

Then the four compound sites become (may_fold_fp(stmt) || is_integral(stmt->ret_type.get_element_type())), and the standalone site (line 408) becomes just may_fold_fp(stmt). This eliminates a repeated two-operand boolean expression that reviewers have to mentally re-parse each time, and ensures a future change (e.g. adding a third condition) only needs to touch one place.

2. Fold the `if (stmt->precise)` guard into `disable_fast_math` â€” duplicated in `codegen_llvm.cpp`

The identical three-line block appears twice, once at the end of visit(UnaryOpStmt *) (line 496) and once at the end of visit(BinaryOpStmt *) (line 776):

if (stmt->precise) {
  disable_fast_math(llvm_val[stmt]);
}

The SPIR-V backend already does this the right way: maybe_no_contraction(v, precise) absorbs the bool check internally. Apply the same pattern to the LLVM side â€” make disable_fast_math accept the precise bool and early-return when false:

void maybe_disable_fast_math(llvm::Value *v, bool precise) {
  if (!precise) return;
  auto *inst = llvm::dyn_cast<llvm::Instruction>(v);
  // ... clear flags ...
}

Each call site collapses to a single line: maybe_disable_fast_math(llvm_val[stmt], stmt->precise);. This also makes the LLVM and SPIR-V backends consistent in style.

3. Redundant double-check of `precise` in SPIR-V unary codegen

In quadrants/codegen/spirv/spirv_codegen.cpp (around line 893):

if (stmt->precise && is_real(stmt->element_type())) {
    ir_->maybe_no_contraction(val, /*precise=*/true);
}

The call passes a literal true to maybe_no_contraction, which then internally checks if (precise). The inner check is guaranteed to be true. Either:

Pass stmt->precise instead of true and drop the outer stmt->precise && guard (let the function handle it), or
Call ir_->decorate(...) directly since we already know precise is true.

Either option removes a layer of redundancy.

4. Empty `if` body for `pow` precise bail â€” `alg_simp.cpp`

The pow-handling block (around line 470) uses an empty if body as a no-op guard:

if (stmt->precise) {
    // Preserve the user's `pow()` call verbatim.
} else if (exponent_one_optimize(stmt)) {

An empty if body that exists only to skip the else chain is unconventional and easy to misread. A simple negated guard is shorter and clearer:

if (!stmt->precise) {
    if (exponent_one_optimize(stmt)) {
        // a ** 1 -> a
    } else if (exponent_zero_optimize(stmt)) {
    // ...
    }
}

5. `BinaryOpExpression::flatten` â€” minor: use `make_typed` for inline field set

In quadrants/ir/frontend_ir.cpp (around line 432), the PR expands a one-liner into three statements to set precise:

auto bin_stmt = std::make_unique<BinaryOpStmt>(type, lhs_stmt, rhs_stmt, /*is_bit_vectorized=*/false, dbg_info);
bin_stmt->precise = precise;
ctx->push_back(std::move(bin_stmt));

This is fine, but if the BinaryOpStmt constructor accepted an optional precise parameter (as was done for the SPIR-V builder's add/sub/mul/div/mod functions), the flatten code would stay a one-liner. The same applies to UnaryOpStmt in the preceding UnaryOpExpression::flatten. This is a minor point â€” the current approach is defensible if keeping the constructor signature stable is preferred.

Summary

The biggest win is item #1: extracting the repeated fast_math && !stmt->precise guard into a helper. It touches 5 sites in alg_simp.cpp alone and would eliminate the most pervasive pattern of duplicated logic introduced by this PR. Items #2 and #3 are smaller but improve consistency between the LLVM and SPIR-V backends.

Could anything run faster at runtime?

TL;DR

The non-precise hot path is unaffected â€” generated code is identical to before. But one compile-time decision in alg_simp.cpp leaves significant runtime performance on the table for qd.precise(x ** n), and a few compile-time micro-inefficiencies could be tightened.

1. `pow` precise bail-out leaves runtime performance on the table (alg_simp.cpp:469â€“472)

This is the biggest issue. The current code unconditionally bails out of all exponent rewrites when precise is set:

if (stmt->precise) {
    // Preserve the user's `pow()` call verbatim.
} else if (exponent_one_optimize(stmt)) {

The test file's own comment acknowledges the rewrites (a**1 â†’ a, a**0.5 â†’ sqrt(a), a**n â†’ mul chain) are IEEE-equivalent to the original pow(). The stated reason for bailing is that the synthesized BinaryOpStmt/UnaryOpStmt nodes inherit precise=false, stripping the tag. But that's easily fixable: propagate stmt->precise onto each synthesized node (the same way the 2*a â†’ a+a rewrite already does at line 382). Then the user gets the fast mul-chain and keeps IEEE-strict semantics on every op.

As written, qd.precise(x ** 4) emits a pow(x, 4) libcall/intrinsic at runtime instead of three multiplies â€” that is an order-of-magnitude slower on every backend (GPU pow is ~50 cycles vs. ~12 cycles for three FMULs). Users of compensated arithmetic do use small-integer powers, so this is likely to bite.

Recommendation: Propagate precise onto synthesized stmts in exponent_n_optimize, exponent_half_optimize, exponent_negative_optimize (as is already done for the alg_is_two â†’ add rewrite), and remove the blanket bail-out.

2. Non-precise runtime path: no regression

For every codegen backend, the precise check only fires when stmt->precise == true:

LLVM: disable_fast_math() is behind if (stmt->precise) â€” non-precise ops never call it, so the emitted LLVM IR has the same fast-math flags as before.
CUDA: const bool use_fast = compile_config.fast_math && !stmt->precise â€” when precise=false, this evaluates to the same value as the old compile_config.fast_math. Same libdevice function is selected.
SPIR-V: NoContraction decorations are only appended when precise=true. The binary for non-precise kernels is byte-identical.

The alg_simp.cpp and binary_op_simplify.cpp changes add && !stmt->precise to conditions that already test fast_math. Since stmt->precise defaults to false and the branch predictor will trivially learn this, the added boolean check costs â‰ˆ0 cycles per statement visit during compilation. No runtime cost at all.

3. `disable_fast_math` clears 7 flags individually (codegen_llvm.cpp:31â€“41)

The function calls 7 separate setHas* methods. The comment says setFastMathFlags(FastMathFlags{}) only OR's flags in, but LLVM provides Instruction::copyFastMathFlags(FastMathFlags) (which replaces rather than ORs) and FPMathOperator::setFastMathFlags with a replace overload. A single call to that with a default-constructed FastMathFlags (all clear) would be cleaner and marginally faster. This is compile-time-only, but it's called once per precise op during codegen, and 7 virtual-dispatch setter calls are noisier than one.

Recommendation: Verify which LLVM version the project targets and switch to copyFastMathFlags(FastMathFlags{}) if available.

4. `precise()` expression walker uses `shared_ptr` copies and RTTI (expr.cpp:55â€“79)

The worklist is std::vector<Expr>, and Expr holds a shared_ptr<Expression>. Every stack.push_back(bin->lhs) performs an atomic ref-count increment; every stack.pop_back() decrements. The cur.cast<BinaryOpExpression>() call goes through std::dynamic_pointer_cast, which does a RTTI dynamic_cast under the hood. For a node that matches none of the three types (Binary/Unary/Ternary), all three dynamic_pointer_cast calls fire and fail.

This runs at kernel-definition time (Python â†’ AST), not at GPU runtime, and expression trees are typically small (a handful of nodes per qd.precise() call). So the absolute cost is negligible. But if this ever needs to scale (e.g., wrapping large generated expression trees):

Use raw Expression* in the worklist instead of Expr to avoid atomic ref-count churn.
Use a type-tag / virtual dispatch pattern (e.g., expr->node_kind() enum) instead of trial-and-error dynamic_pointer_cast.

Verdict: Not a practical concern today, but worth noting for future-proofing.

5. SPIR-V `maybe_no_contraction` is a non-inlined call on every FP binary op

maybe_no_contraction(v, precise) is defined in spirv_ir_builder.cpp and called from macros in the same TU, so the compiler may inline it. But if it doesn't (separate compilation, no LTO), every add/sub/mul/div on FP types pays for a function call + branch even when precise=false. Making it inline in the header (it's a trivial one-liner) would guarantee the branch is folded away at the call site.

Recommendation: Move the body to the header as an inline method, or mark it __attribute__((always_inline)).

6. `QD_STMT_DEF_FIELDS` now includes `precise` â€” CSE/identity cost

Adding precise to the serialized fields of UnaryOpStmt and BinaryOpStmt means:

StmtFieldManager::equal() compares one more field per op during CSE passes.
The offline cache key is one byte longer per op.

This is correct and necessary (a precise a+b must not be deduplicated with a non-precise a+b), and the cost is trivially small. No concern.

7. SPIR-V decoration overhead at runtime

NoContraction is a SPIR-V metadata decoration. It does not add instructions to the shader binary â€” it's a directive to the downstream compiler (SPIRV-Cross, driver frontend) to avoid contraction. There is no per-invocation runtime cost for the decoration itself. The runtime cost is the absence of optimizations (no FMA contraction, no reassociation), which is the intended semantic.

Summary of actionable items

Priority	Item	Impact
High	Propagate `precise` onto synthesized stmts in `pow` exponent rewrites instead of bailing out	Avoids `pow()` libcall for `qd.precise(x**n)` â€” up to 5â€“10Ã— faster for small integer exponents
Low	Replace 7 individual FMF clears with `copyFastMathFlags({})`	Cleaner code, marginal compile-time improvement
Low	Inline `maybe_no_contraction` in the SPIR-V builder header	Avoids one function call per FP op during SPIR-V codegen
Negligible	Use raw pointers in `precise()` walker	Only matters for hypothetical very large expression trees

Could anything exhibit incorrect behavior?

Yes. There are several issues ranging from a clear codegen bug to latent correctness risks. Listed from most to least severe:

1. AMDGPU backend: `precise` flag is silently ignored for `BinaryOpStmt` (codegen bug)

The AMDGPU codegen (codegen_amdgpu.cpp:386-421) overrides visit(BinaryOpStmt *) for pow and atan2. For all other ops it delegates to TaskCodeGenLLVM::visit(stmt), which does apply disable_fast_math when stmt->precise is set. But for pow and atan2, the AMDGPU override returns without calling disable_fast_math:

  void visit(BinaryOpStmt *stmt) override {
    auto op = stmt->op_type;
    // ...
    if (op == BinaryOpType::pow) {
      // ... calls __ocml_pow_* ...
    } else if (op == BinaryOpType::atan2) {
      // ... calls __ocml_atan2_* ...
    }
    // No `if (stmt->precise) disable_fast_math(...)` here!
  }

The CUDA backend has the same structural pattern for pow/atan2 (lines 660-710), and it too omits disable_fast_math â€” but this matters less for pow/atan2 since they're plain function calls, not FP intrinsics with FMF. Still, the omission means the FPExt/FPTrunc instructions around them (e.g. line 675, 708 in CUDA; and the CreateSIToFP/CreateFPToSI pair in AMDGPU line 403-406) do carry fast-math flags and won't have them cleared for precise ops. In practice, FMF on cast instructions is usually benign, but the inconsistency is a latent risk.

The doc table (line 104) claims AMDGPU has precise support ("LLVM FMF cleared") but no code changes were made to codegen_amdgpu.cpp. The base-class TaskCodeGenLLVM handles normal arithmetic, but emit_extra_unary is completely overridden in AMDGPU and never checks stmt->precise â€” the __ocml_* functions are always called unconditionally. Unlike CUDA (which has __nv_fast_* variants), AMDGPU's __ocml_* functions are already correctly-rounded, so this is not a behavioral bug today. But if the compiler ever adds __ocml_fast_* variants, this code path has no precise guard to prevent selecting them.

2. SPIRV `inv` unary op: `ir_->div()` called without `precise` (minor, masked)

In spirv_codegen.cpp:841, the inv unary op calls ir_->div() without forwarding stmt->precise:

  } else if (stmt->op_type == UnaryOpType::inv) {
    if (is_real(dst_dt)) {
      val = ir_->div(ir_->float_immediate_number(dst_type, 1), operand_val);
    } else {
      QD_NOT_IMPLEMENTED
    }

This is masked because the post-hoc maybe_no_contraction(val, true) at line 895-896 decorates the same SPIR-V value ID afterward. So the OpFDiv does get NoContraction â€” but only by accident. If anyone refactors this code to return early before line 895, the decoration would be silently lost.

3. Commutative swap in `binary_op_simplify.cpp` not gated on `precise`

In binary_op_simplify.cpp:83-88, the commutative operand swap (const_lhs to RHS) happens unconditionally â€” before the precise check at line 91:

  void visit(BinaryOpStmt *stmt) override {
    auto const_lhs = stmt->lhs->cast<ConstStmt>();
    if (const_lhs && is_commutative(stmt->op_type) && !stmt->rhs->is<ConstStmt>()) {
      stmt->lhs = stmt->rhs;
      stmt->rhs = const_lhs;
      operand_swapped = true;
    }
    if ((!fast_math || stmt->precise) && !is_integral(stmt->ret_type)) {
      return;
    }

For FP addition and multiplication, commutativity is IEEE-guaranteed (a + b == b + a bitwise), so this swap is technically correct. However, for a feature whose explicit purpose is "source-order evaluation," silently reordering operands is semantically surprising. If a future version of the precise contract promises source-order operand evaluation (as MSL/HLSL precise does), this would become a bug.

4. In-place mutation is not thread-safe and causes aliasing surprises

The precise() walker in expr.cpp mutates Expression nodes in-place via shared_ptr. The doc acknowledges this:

"If you alias a subexpression and then wrap one alias in qd.precise, both uses get IEEE semantics."

This is documented, but it's still surprising behavior that could cause subtle bugs. Consider:

ab = a + b
x = qd.precise(ab)  # tags the BinaryOpExpression in-place
y = ab * 2           # ab is now also precise â€” not what the user intended

The user sees two independent-looking variables but they share an expression node. The in-place mutation means qd.precise has a non-local effect on every alias of every subexpression in the tree. This is a correctness hazard in any code that builds expression trees with shared subexpressions.

5. `TernaryOpExpression` is traversed but never tagged `precise`

The walker descends through TernaryOpExpression to reach inner ops, but does not set a precise flag on the ternary itself. The IR-level TernaryOpStmt has no precise field. This is correct for select (which lowers to a non-FP conditional move), but if a future ternary op has FP semantics (e.g. fma as a ternary), it would silently lack the precise tag.

6. `alg_simp.cpp`: `1 * a -> a` fold is not gated on `precise`

The optimize_multiplication function folds 1 * a -> a (line 342-346) unconditionally:

    if (alg_is_one(lhs) || alg_is_one(rhs)) {
      // 1 * a -> a, a * 1 -> a
      stmt->replace_usages_with(alg_is_one(lhs) ? stmt->rhs : stmt->lhs);
      modifier.erase(stmt);
      return true;
    }

This fold is IEEE-correct (1.0 * x == x for all finite x and for all NaN/Inf, under IEEE 754). So it's not wrong. But the precise tag's purpose is to prevent any rewriting of FP ops, and this fold does remove an instruction that the user explicitly marked. The analog in HLSL's precise would preserve this multiplication. This is a philosophical deviation from the stated "source-order, no simplification" contract.

Similarly, a / 1 -> a (line 395-400) is also unconditional and IEEE-correct but philosophically inconsistent.

7. `constant_fold.cpp` and `simplify.cpp` don't check `precise`

constant_fold.cpp folds BinaryOpStmt and UnaryOpStmt when both/all operands are constants. This can fold a precise-tagged operation into a constant, effectively removing it from the instruction stream before codegen ever sees the precise flag. For the constant-fold case specifically, the result is numerically identical (constant folding uses the host's IEEE arithmetic), so this is safe in practice. But it's another case where the precise tag is silently discarded.

simplify.cpp synthesizes BinaryOpStmt nodes (e.g. for LinearizeStmt) that inherit precise=false by default. This is fine since those synthetic ops are on integer index calculations, not user FP expressions.

8. Offline cache key serialization: `TernaryOpExpression` not serialized

gen_offline_cache_key.cpp serializes precise for UnaryOpExpression and BinaryOpExpression, but TernaryOpExpression has no precise field to serialize. This is currently consistent (ternary doesn't carry precise), but if the ternary ever gains a precise flag (see item 5), the cache key would need updating.

Summary of real bugs vs. latent risks

Issue	Severity	Real bug today?
AMDGPU: no `disable_fast_math` for `pow`/`atan2`	Medium	No (ocml is already correctly-rounded), but the doc overclaims
SPIRV `inv`: `div()` without `precise`	Low	No (masked by post-hoc decoration)
Commutative swap before `precise` check	Low	No (IEEE-correct), but violates spirit of `precise`
In-place aliasing mutation	Medium	Yes â€” surprising side-effects on shared subexpressions
`1 * a -> a` fold not gated on `precise`	Low	No (IEEE-correct), but violates stated "no simplification" contract
Constant fold ignores `precise`	Low	No (numerically identical result)

Test coverage analysis

The PR adds four tests in test_precise.py plus one entry in test_api.py. Below is a per-item assessment of what is and is not exercised.

What IS covered

#	Code change	How it is tested
1	Offline cache key (`gen_offline_cache_key.cpp`)	Indirectly by `test_qd_precise_protects_fast_math`. The naive kernel compiles first, populating the cache; if the cache key did not include the `precise` flag, the precise kernel would reuse the naive artifact and the final assertion (`ds_err < naive_err * 1e-3`) would fail. The test comment on lines 91-95 explicitly calls this out. Clever.
3	LLVM `disable_fast_math()`	Exercised end-to-end by both `test_qd_precise_protects_fast_math` (binary ops) and `test_qd_precise_unary_rounding` (unary ops) on any LLVM-based backend (CPU, CUDA, AMDGPU).
6a	`alg_simp`: `a + 0 â†’ a` / `0 + a â†’ a` precise gate	Directly tested by `test_qd_precise_recurses_through_select`, which uses the `(-0.0) + 0.0` signed-zero observable to distinguish the folded and non-folded cases.
6b	`alg_simp`: `a - a â†’ 0` precise gate	Indirectly tested by `test_qd_precise_protects_fast_math`. The Dekker 2Sum compensation `(a - aa) + (b - bb)` is algebraically zero; `a - a â†’ 0` is the dominant fold that kills it. The assertion that `lo â‰ 0` validates this gate.
8	`precise` flag on `UnaryOpStmt` / `BinaryOpStmt` (`statements.h` `QD_STMT_DEF_FIELDS`)	Propagation from Expression â†’ Stmt is exercised by every end-to-end test. The `QD_STMT_DEF_FIELDS` inclusion also affects IR hashing/comparison, which is indirectly validated by the cache-key test above.
â€”	TernaryOp walker recursion	`test_qd_precise_recurses_through_select` directly tests that `qd.precise(qd.select(...))` recurses through the TernaryOp and tags inner binary ops.
â€”	`ValueError` on Quadrants classes	`test_qd_precise_rejects_quadrants_classes` covers `Vector` and `Matrix`.
â€”	Unary precise (sin/cos/log/sqrt)	`test_qd_precise_unary_rounding` verifies â‰¤2 ULP against a correctly-rounded f32 reference.
â€”	Public API surface	`test_api.py` verifies `precise` is in `qd.__all__`.

What is NOT covered

#	Code change	Gap
2	CUDA codegen `use_fast` variable (fast libdevice selection for sin/cos/log)	Only exercised when the test runner happens to use a CUDA backend. There is no `@pytest.mark.parametrize` or arch filter ensuring CUDA-specific coverage. On CPU-only CI, this code path is dead.
4	SPIR-V `NoContraction` decoration (`spirv_codegen.cpp`)	Same issue: only exercised when the active backend is Vulkan or Metal. No SPIR-V-specific test. No unit test that inspects the generated SPIR-V IR for the decoration.
5	SPIR-V IR builder `precise` parameter on `add`/`sub`/`mul`/`div`/`mod`	Same as #4 â€” backend-dependent, no unit-level test of the IR builder itself.
6c	*`alg_simp`: `0 a â†’ 0` precise gate**	Not tested. A test could do `qd.precise(x * zero)` where `x` is `NaN`; under IEEE `NaN * 0 = NaN`, but the fast-math fold would yield `0`.
6d	`alg_simp`: `a / a â†’ 1` precise gate	Not tested. A test could do `qd.precise(x / x)` where `x = 0.0`; under IEEE `0/0 = NaN`, fast-math folds to `1`.
6e	*`alg_simp`: `a / const â†’ a (1/const)` precise gate**	Not tested. Could be observed with a carefully chosen constant where `a * (1/c)` and `a / c` differ at the ULP level.
6f	*`alg_simp`: `2 a â†’ a + a` precise propagation**	The transformation itself is exercised (it fires for both precise and non-precise), but the test never verifies that the synthesized `add` carries `sum->precise = stmt->precise`. A regression here would silently strip the tag on the generated add.
6g	`alg_simp`: comparison `a == a â†’ 1` precise gate	Not tested. With NaN inputs, `NaN == NaN` should be false under IEEE but the fast-math fold makes it true.
7	`binary_op_simplify.cpp` changes	Neither the `try_rearranging_const_rhs` precise bail nor the `(!fast_math \|\| stmt->precise)` early return is directly tested. The 2Sum test may exercise the early return indirectly (the precise ops have const-zero operands that would match rearrangement patterns), but there is no targeted assertion.
9	`fast_math=False` + `qd.precise` (redundant but should work)	Not tested. All kernel tests use `fast_math=True`. A simple test with `fast_math=False` plus `qd.precise` wrapping would verify the redundant path doesn't crash or regress.
10	Different data types (f64, f16)	Every test uses `qd.f32` exclusively. `f64` is important because the CUDA codegen for sin/cos/log on f64 takes a different branch (`__nv_sin` vs `__nv_sinf`). `f16` is relevant because `BinaryOpStmt::visit` has special `f16` widening/narrowing around the codegen that interacts with the `disable_fast_math` call placement.
11	Nested `qd.precise` calls	`qd.precise(qd.precise(expr))` is never tested. The walker is idempotent so this should be harmless, but it's an untested edge case.
12	`qd.precise` on integer expressions	Not tested. The code paths in `alg_simp.cpp` carefully check `is_integral(...)` to preserve integer folds even when `precise` is set. A test wrapping integer arithmetic in `qd.precise` would validate that integer simplifications are not accidentally suppressed.
13	`pow` precise-bail	Deliberately omitted with a well-reasoned comment (lines 223-227 of the test file) explaining that the rewrite targets are IEEE-equivalent and no runtime observable difference exists today. Acceptable as-is, but the guard is purely future-proofing and could silently break without notice.
14	Walker stopping at loads, constants, func calls	Partially covered: the 2Sum test uses loads and `qd.func` calls as boundaries, so if the walker incorrectly descended into them it would hit a crash or type error. But there is no explicit test that e.g. a `qd.func` call site's inner body remains non-precise when only the call site is wrapped. The doc says "semantics inside a `qd.func` body are governed by that body's own ops" â€” this contract is not tested. Atomic ops as a walker boundary are also untested.

Summary

The tests are well-designed for the happy path: the Dekker 2Sum test is the right canonical workload and the signed-zero select test is clever. However, coverage is narrow along two axes:

Breadth of simplification rules: Only 2 of ~7 modified alg_simp rules are observably tested (a + 0 and a - a). The rest (0 * a, a / a, a / const, 2*a tag propagation, comparisons) have no targeted tests. Several of these have clean IEEE-observable differences (NaN propagation for 0 * NaN, 0/0, NaN == NaN) that would make good test cases.
Backend-specific codegen: CUDA libdevice selection and SPIR-V NoContraction decoration are only tested when those backends happen to be active. On CPU-only CI, those code paths get zero coverage. A SPIR-V IR unit test that checks for the decoration would be especially valuable since NoContraction is the only mechanism protecting compensated arithmetic on Metal/Vulkan.
Type coverage: f64 and f16 are entirely untested, yet both have distinct codegen branches that interact with the precise flag.

The binary_op_simplify.cpp changes (both the try_rearranging_const_rhs precise bail and the early-return gate) have no targeted tests. These are the most likely to regress silently since the bail conditions are subtle (they depend on the inner LHS being a precise BinaryOpStmt with a const RHS).

Test factorization analysis

The test file has four tests in ~227 lines. The structure is reasonable for a new feature, but there are concrete opportunities to reduce duplication without sacrificing clarity.

1. Unary rounding test: parametrize over functions

test_qd_precise_unary_rounding tests sin, cos, log, and sqrt in a single kernel, computing all four in lockstep and comparing against a stacked numpy reference. This "batch" design means a failure in any one function fails the whole test with a single max-ULP number, making triage harder.

Suggestion: Use @pytest.mark.parametrize to run each function independently. The repo already does this in test_scalar_op.py (lines 51â€“65) where binary_func_table / unary_func_table are parametrized. Here the kernel would take the op as a qd.static(...) parameter:

@pytest.mark.parametrize("op_name", ["sin", "cos", "log", "sqrt"])
@test_utils.test(default_fp=qd.f32, fast_math=True)
def test_qd_precise_unary_rounding(op_name):
    qd_op = getattr(qd, op_name)
    np_op = getattr(np, op_name)

    @qd.kernel
    def k(x: qd.types.ndarray(qd.f32, ndim=1),
          out: qd.types.ndarray(qd.f32, ndim=1)):
        for i in range(x.shape[0]):
            out[i] = qd.precise(qd_op(x[i]))

    xs = np.array([0.5, 1.5, 2.5, 4.0, 7.0, 10.0, 25.0, 50.0], dtype=np.float32)
    in_arr = qd.ndarray(dtype=qd.f32, shape=(len(xs),))
    in_arr.from_numpy(xs)
    out = qd.ndarray(dtype=qd.f32, shape=(len(xs),))
    k(in_arr, out)
    res = out.to_numpy()

    ref = np_op(xs.astype(np.float64)).astype(np.float32)
    ulp = np.spacing(np.maximum(np.abs(ref), np.float32(1.0)))
    max_ulp = float(np.max(np.abs(res - ref) / ulp))
    assert max_ulp <= 2.0, f"qd.precise(qd.{op_name}(x)) off by {max_ulp:.2f} ULP"

This gives per-function pass/fail reporting and is trivially extensible (add "exp", "tan", etc.).

2. Unify `two_sum_naive`/`two_sum_precise` and `fast_two_sum_naive`/`fast_two_sum_precise` with `qd.static()`

The main test defines four @qd.func helpers â€” two_sum_naive, two_sum_precise, fast_two_sum_naive, fast_two_sum_precise â€” that are structurally identical except for whether each op is wrapped in qd.precise(). The repo already uses if qd.static(...) inside kernels/funcs for exactly this kind of compile-time branching (see test_static.py, test_cache.py, bls_test_template.py).

Suggestion: Collapse to two funcs using a use_precise: qd.template() parameter:

@qd.func
def two_sum(a, b, use_precise: qd.template()):
    if qd.static(use_precise):
        s = qd.precise(a + b)
        bb = qd.precise(s - a)
        aa = qd.precise(s - bb)
        e = qd.precise((a - aa) + (b - bb))
    else:
        s = a + b
        bb = s - a
        aa = s - bb
        e = (a - aa) + (b - bb)
    return s, e

@qd.func
def fast_two_sum(a, b, use_precise: qd.template()):
    if qd.static(use_precise):
        s = qd.precise(a + b)
        e = qd.precise(b - (s - a))
    else:
        s = a + b
        e = b - (s - a)
    return s, e

This removes two of the four helper functions and makes the naive-vs-precise symmetry explicit instead of implicit. The two kernel functions (df_accum_naive, df_accum_precise) would similarly collapse into one kernel that takes use_precise: qd.template().

However, there is a counter-argument: the qd.static version still duplicates the expression bodies inside the if/else branches, so the line savings are modest. The current form arguably makes the "what does precise protect?" story easier to read side-by-side. This is a judgment call â€” I'd suggest unifying only if the test file grows more variants (e.g. two_sum_f64, different accumulation orders), where the current copy-paste approach would scale badly.

3. The two kernels `df_accum_naive`/`df_accum_precise` could be unified

Same pattern as above. These are identical except for calling the _naive vs _precise helpers and wrapping e + lo in qd.precise. A single kernel parametrized by use_precise: qd.template() would cut ~20 lines:

@qd.kernel
def df_accum(in_arr: qd.types.ndarray(qd.f32, ndim=1),
             out: qd.types.ndarray(qd.f32, ndim=1),
             use_precise: qd.template()):
    for _ in range(1):
        hi = qd.f32(1.0)
        lo = qd.f32(0.0)
        for i in range(N):
            s, e = two_sum(hi, in_arr[i], use_precise)
            if qd.static(use_precise):
                e = qd.precise(e + lo)
            else:
                e = e + lo
            hi, lo = fast_two_sum(s, e, use_precise)
        out[0] = hi
        out[1] = lo

df_accum(in_arr, out_naive, False)
df_accum(in_arr, out_precise, True)

4. Extract shared ndarray setup

The pattern of creating input/output ndarrays, calling a kernel, and reading back numpy results appears in every test. A small helper would reduce boilerplate:

def _make_f32(shape, fill=None, data=None):
    arr = qd.ndarray(dtype=qd.f32, shape=shape)
    if data is not None:
        arr.from_numpy(data)
    elif fill is not None:
        arr.from_numpy(np.full(shape, fill, dtype=np.float32))
    return arr

This is minor (saves ~3 lines per test), but every other test file in the repo seems to repeat the same pattern, so it's not really a divergence from project style. Not a strong recommendation.

5. Parametrize naive vs. precise comparison in the main test? Not recommended.

One might be tempted to @pytest.mark.parametrize("use_precise", [True, False]) on the main 2Sum test and assert different things per variant. But the test's point is to run both in the same qd.init session so the offline-cache-key distinction is validated (running naive first populates the cache; running precise second must not reuse that artifact). Splitting into two parametrized invocations would lose that coverage because @test_utils.test calls qd.init separately per invocation. Keep this as a single test.

6. Summary of concrete recommendations

Change	Lines saved (approx.)	Readability impact	Recommended?
Parametrize unary test over `sin/cos/log/sqrt`	~15	Better: per-op pass/fail	Yes
Unify `two_sum_naive`/`two_sum_precise` via `qd.static`	~15	Neutral to slightly worse	Soft yes â€” do it if more variants are coming
Unify `df_accum_naive`/`df_accum_precise` via `qd.static`	~20	Neutral	Soft yes, same caveat
Extract ndarray helper	~10	Marginal	No â€” not worth a project-wide pattern break
Parametrize naive-vs-precise as two test cases	0	Worse: breaks cache-key coverage	No

The strongest win is (1): parametrize the unary test. It costs nothing, improves diagnostics, and matches existing repo conventions (test_scalar_op.py). Items (2)â€“(3) are nice-to-haves if the file grows; the current duplication is tolerable at this size.

hughperkins · 2026-04-13T17:12:19Z

Lets address the missing tests identified by AI above
lets address the following AI-identified issue:
"3. Near-identical bool precise{false} field + comment across four IR classes
The bool precise{false}; field is added to four classes, each with a near-identical two-line doc comment ("Mirrors MSL/HLSL precise"):

UnaryOpExpression (frontend_ir.h:375â€“377)
BinaryOpExpression (frontend_ir.h:401â€“403)
UnaryOpStmt (statements.h:158â€“160)
BinaryOpStmt (statements.h:254â€“256)
The comments are subtly inconsistent (the UnaryOpStmt version says "no contraction, no approximate implementations" while BinaryOpStmt says "no reassociation, no contraction, no algebraic folds"). This isn't a code-level refactoring opportunity per se (C++ doesn't have a great mechanism for sharing a field across unrelated class hierarchies), but the comment wording should be unified. A single canonical comment could be defined next to the precise() function in expr.h and the field-level comments shortened to a cross-reference."

hughperkins · 2026-04-13T17:46:04Z

Lets' address also:

precise inserted out of order in all
In python/quadrants/lang/ops.py, "precise" is inserted between "cast" and "ceil" (line 1564). The surrounding entries in that section of all are roughly alphabetical (bit_cast, bit_shr, cast, ceil, cos, exp, ...), so precise should appear later, after pow. The test_api.py placement is correct (between pow and profiler), making the inconsistency more visible.

Inconsistent QD_NOT_IMPLEMENTED reformatting in quadrants/codegen/spirv/spirv_codegen.cpp
The PR reformats the one-liner in the unary visit() to a proper multi-line block:

else {
QD_NOT_IMPLEMENTED
}
But the structurally identical pattern in the binary visit() is left as a single line:

else {QD_NOT_IMPLEMENTED} ir_->register_value(bin_name, bin_value);
This creates an internal inconsistency in the same file. If the unary path was worth reformatting (which it was â€” AllowShortBlocksOnASingleLine: false in .clang-format), the binary path should get the same treatment.

Stmt::make<> vs Stmt::make_typed<> inconsistency within the same file
In alg_simp.cpp, the PR changes one Stmt::make to Stmt::make_typed for the 2 * a â†’ a + a rewrite (line 377) so that it can set sum->precise. However, every other synthesized BinaryOpStmt in the same file still uses Stmt::make (lines 224, 233, 270, 272, 362, 416, 429). The mix of make<> and make_typed<> for the same type within a single file is a style inconsistency. Either is fine, but the PR should pick the one already dominant in the file (make<>), or convert them all to make_typed<>.

Synthesized stmts in alg_simp.cpp don't propagate precise â€” inconsistent treatment
The 2 * a â†’ a + a rewrite correctly copies stmt->precise to the new sum stmt. But several other rewrites in the same file create new BinaryOpStmt / UnaryOpStmt nodes that inherit precise=false by default:

a ** 0.5 â†’ sqrt(a) (line 186): creates UnaryOpStmt(sqrt) without copying precise.
a ** n â†’ exponentiation by squaring (lines 224, 233): creates mul stmts without precise.
a ** -n â†’ 1 / a**n (lines 268â€“272): creates neg, pow, div stmts without precise.
a / const â†’ a * (1/const) (line 416): creates mul without precise.
The PR handles this for pow by bailing out entirely when stmt->precise is set (the if (stmt->precise) { ... } guard at line 470), which is correct for pow. But the a / const â†’ a * (1/const) rewrite is guarded only by fast_math && !stmt->precise at a higher level (line 408), and the sqrt and exponentiation-by-squaring rewrites are not behind a precise guard at all. If a user wrote qd.precise(x ** 0.5), the exponent_half_optimize path would fire and produce an untagged sqrt, silently dropping the precise intent. This is a functional gap, not just a style issue.

UnaryOpStmt::same_operation() does not account for precise
same_operation() (in statements.cpp, line 25) only compares op_type and cast_type. Two UnaryOpStmts that differ only in precise will be reported as the "same operation." This could cause CSE or other passes that use same_operation() to merge a precise op with a non-precise one. The QD_STMT_DEF_FIELDS macro does include precise, so the field_manager.equal() path in same_statements.cpp is correct, but any call site relying on same_operation() alone could be a problem.

TernaryOpExpression and TernaryOpStmt lack a precise field
The walker in expr.cpp descends through TernaryOpExpression and tags its children, but never sets a precise flag on the TernaryOpExpression itself. TernaryOpStmt likewise has no precise field. This is internally consistent (select/ifte are not FP arithmetic), but it means the offline cache key generator (gen_offline_cache_key.cpp) does not emit precise for ternary nodes â€” which is fine today since they don't carry the flag, but creates an asymmetry: if a ternary's children are tagged precise, the ternary's cache key stays the same, and correctness relies entirely on the children's cache keys distinguishing themselves. This is correct but worth documenting as intentional.

Python-side: no @quadrants_scope decorator on precise()
bit_cast() and cast() in ops.py do not use @quadrants_scope either, so precise() is consistent with those immediate neighbors. However, many other ops in the file do use the decorator. Since precise() only makes sense inside a kernel/func body (it calls _qd_core.precise which requires an active IR context), the lack of the decorator means the error message for out-of-scope usage will be a cryptic C++ exception rather than a clean Python-level error. This matches cast/bit_cast behavior, so it is not a new inconsistency, but it is a missed opportunity to be consistent with the majority of the file.

Docstring in ops.py says "every binary FP op" but code also tags unary ops
The docstring for precise() says "Every binary FP op inside obj is evaluated in source orderâ€¦" and the Returns section says "with every reachable binary op tagged as precise". But the C++ implementation (and the longer body of the docstring) also tags UnaryOpExpression nodes. The summary line and Returns section are inaccurate â€” they should say "binary and unary FP ops."

Empty if body for pow precise bail â€” alg_simp.cpp
The pow-handling block (around line 470) uses an empty if body as a no-op guard:

if (stmt->precise) {
// Preserve the user's pow() call verbatim.
} else if (exponent_one_optimize(stmt)) {
An empty if body that exists only to skip the else chain is unconventional and easy to misread. A simple negated guard is shorter and clearer:

if (!stmt->precise) {
if (exponent_one_optimize(stmt)) {
// a ** 1 -> a
} else if (exponent_zero_optimize(stmt)) {
// ...
}
}

For this one, not necessarily requesting it to be impelmented, but would at least like you to acknowledge and respond to it please:

pow precise bail-out leaves runtime performance on the table (alg_simp.cpp:469â€“472)
This is the biggest issue. The current code unconditionally bails out of all exponent rewrites when precise is set:

if (stmt->precise) {
// Preserve the user's pow() call verbatim.
} else if (exponent_one_optimize(stmt)) {
The test file's own comment acknowledges the rewrites (a1 â†’ a, a0.5 â†’ sqrt(a), a**n â†’ mul chain) are IEEE-equivalent to the original pow(). The stated reason for bailing is that the synthesized BinaryOpStmt/UnaryOpStmt nodes inherit precise=false, stripping the tag. But that's easily fixable: propagate stmt->precise onto each synthesized node (the same way the 2*a â†’ a+a rewrite already does at line 382). Then the user gets the fast mul-chain and keeps IEEE-strict semantics on every op.

As written, qd.precise(x ** 4) emits a pow(x, 4) libcall/intrinsic at runtime instead of three multiplies â€” that is an order-of-magnitude slower on every backend (GPU pow is ~50 cycles vs. ~12 cycles for three FMULs). Users of compensated arithmetic do use small-integer powers, so this is likely to bite.

Recommendation: Propagate precise onto synthesized stmts in exponent_n_optimize, exponent_half_optimize, exponent_negative_optimize (as is already done for the alg_is_two â†’ add rewrite), and remove the blanket bail-out.

Not sure if the following is still true in LLVM 22. Also not sure if this will affect code correctness. Would at least like you to acknowledge and respond to it please:

disable_fast_math clears 7 flags individually (codegen_llvm.cpp:31â€“41)
The function calls 7 separate setHas* methods. The comment says setFastMathFlags(FastMathFlags{}) only OR's flags in, but LLVM provides Instruction::copyFastMathFlags(FastMathFlags) (which replaces rather than ORs) and FPMathOperator::setFastMathFlags with a replace overload. A single call to that with a default-constructed FastMathFlags (all clear) would be cleaner and marginally faster. This is compile-time-only, but it's called once per precise op during codegen, and 7 virtual-dispatch setter calls are noisier than one.

Recommendation: Verify which LLVM version the project targets and switch to copyFastMathFlags(FastMathFlags{}) if available.

Not requesting the following, but seems like the kind of thing you might find interesting?

precise() expression walker uses shared_ptr copies and RTTI (expr.cpp:55â€“79)
The worklist is std::vector, and Expr holds a shared_ptr. Every stack.push_back(bin->lhs) performs an atomic ref-count increment; every stack.pop_back() decrements. The cur.cast() call goes through std::dynamic_pointer_cast, which does a RTTI dynamic_cast under the hood. For a node that matches none of the three types (Binary/Unary/Ternary), all three dynamic_pointer_cast calls fire and fail.

This runs at kernel-definition time (Python â†’ AST), not at GPU runtime, and expression trees are typically small (a handful of nodes per qd.precise() call). So the absolute cost is negligible. But if this ever needs to scale (e.g., wrapping large generated expression trees):

Use raw Expression* in the worklist instead of Expr to avoid atomic ref-count churn.
Use a type-tag / virtual dispatch pattern (e.g., expr->node_kind() enum) instead of trial-and-error dynamic_pointer_cast.
Verdict: Not a practical concern today, but worth noting for future-proofing.

This sounds worht doing? :

SPIR-V maybe_no_contraction is a non-inlined call on every FP binary op
maybe_no_contraction(v, precise) is defined in spirv_ir_builder.cpp and called from macros in the same TU, so the compiler may inline it. But if it doesn't (separate compilation, no LTO), every add/sub/mul/div on FP types pays for a function call + branch even when precise=false. Making it inline in the header (it's a trivial one-liner) would guarantee the branch is folded away at the call site.

Recommendation: Move the body to the header as an inline method, or mark it attribute((always_inline)).

This seems worth checking:

AMDGPU backend: precise flag is silently ignored for BinaryOpStmt (codegen bug)
The AMDGPU codegen (codegen_amdgpu.cpp:386-421) overrides visit(BinaryOpStmt *) for pow and atan2. For all other ops it delegates to TaskCodeGenLLVM::visit(stmt), which does apply disable_fast_math when stmt->precise is set. But for pow and atan2, the AMDGPU override returns without calling disable_fast_math:

void visit(BinaryOpStmt stmt) override {
auto op = stmt->op_type;
// ...
if (op == BinaryOpType::pow) {
// ... calls _ocml_pow ...
} else if (op == BinaryOpType::atan2) {
// ... calls _ocml_atan2* ...
}
// No if (stmt->precise) disable_fast_math(...) here!
}
The CUDA backend has the same structural pattern for pow/atan2 (lines 660-710), and it too omits disable_fast_math â€” but this matters less for pow/atan2 since they're plain function calls, not FP intrinsics with FMF. Still, the omission means the FPExt/FPTrunc instructions around them (e.g. line 675, 708 in CUDA; and the CreateSIToFP/CreateFPToSI pair in AMDGPU line 403-406) do carry fast-math flags and won't have them cleared for precise ops. In practice, FMF on cast instructions is usually benign, but the inconsistency is a latent risk.

The doc table (line 104) claims AMDGPU has precise support ("LLVM FMF cleared") but no code changes were made to codegen_amdgpu.cpp. The base-class TaskCodeGenLLVM handles normal arithmetic, but emit_extra_unary is completely overridden in AMDGPU and never checks stmt->precise â€” the _ocml* functions are always called unconditionally. Unlike CUDA (which has _nv_fast* variants), AMDGPU's _ocml* functions are already correctly-rounded, so this is not a behavioral bug today. But if the compiler ever adds _ocml_fast* variants, this code path has no precise guard to prevent selecting them.

SPIRV inv unary op: ir_->div() called without precise (minor, masked)
In spirv_codegen.cpp:841, the inv unary op calls ir_->div() without forwarding stmt->precise:

} else if (stmt->op_type == UnaryOpType::inv) {
if (is_real(dst_dt)) {
val = ir_->div(ir_->float_immediate_number(dst_type, 1), operand_val);
} else {
QD_NOT_IMPLEMENTED
}
This is masked because the post-hoc maybe_no_contraction(val, true) at line 895-896 decorates the same SPIR-V value ID afterward. So the OpFDiv does get NoContraction â€” but only by accident. If anyone refactors this code to return early before line 895, the decoration would be silently lost.

Commutative swap in binary_op_simplify.cpp not gated on precise
In binary_op_simplify.cpp:83-88, the commutative operand swap (const_lhs to RHS) happens unconditionally â€” before the precise check at line 91:

void visit(BinaryOpStmt *stmt) override {
auto const_lhs = stmt->lhs->cast();
if (const_lhs && is_commutative(stmt->op_type) && !stmt->rhs->is()) {
stmt->lhs = stmt->rhs;
stmt->rhs = const_lhs;
operand_swapped = true;
}
if ((!fast_math || stmt->precise) && !is_integral(stmt->ret_type)) {
return;
}
For FP addition and multiplication, commutativity is IEEE-guaranteed (a + b == b + a bitwise), so this swap is technically correct. However, for a feature whose explicit purpose is "source-order evaluation," silently reordering operands is semantically surprising. If a future version of the precise contract promises source-order operand evaluation (as MSL/HLSL precise does), this would become a bug.

This seems important to me, though might need revisiting architecture? :

In-place mutation is not thread-safe and causes aliasing surprises
The precise() walker in expr.cpp mutates Expression nodes in-place via shared_ptr. The doc acknowledges this:

"If you alias a subexpression and then wrap one alias in qd.precise, both uses get IEEE semantics."

This is documented, but it's still surprising behavior that could cause subtle bugs. Consider:

ab = a + b
x = qd.precise(ab) # tags the BinaryOpExpression in-place
y = ab * 2 # ab is now also precise â€” not what the user intended
The user sees two independent-looking variables but they share an expression node. The in-place mutation means qd.precise has a non-local effect on every alias of every subexpression in the tree. This is a correctness hazard in any code that builds expression trees with shared subexpressions.

Unary rounding test: parametrize over functions
test_qd_precise_unary_rounding tests sin, cos, log, and sqrt in a single kernel, computing all four in lockstep and comparing against a stacked numpy reference. This "batch" design means a failure in any one function fails the whole test with a single max-ULP number, making triage harder.

Suggestion: Use @pytest.mark.parametrize to run each function independently. The repo already does this in test_scalar_op.py (lines 51â€“65) where binary_func_table / unary_func_table are parametrized. Here the kernel would take the op as a qd.static(...) parameter:

Unify two_sum_naive/two_sum_precise and fast_two_sum_naive/fast_two_sum_precise with qd.static()
The main test defines four @qd.func helpers â€” two_sum_naive, two_sum_precise, fast_two_sum_naive, fast_two_sum_precise â€” that are structurally identical except for whether each op is wrapped in qd.precise(). The repo already uses if qd.static(...) inside kernels/funcs for exactly this kind of compile-time branching (see test_static.py, test_cache.py, bls_test_template.py).

Suggestion: Collapse to two funcs using a use_precise: qd.template() parameter:

The two kernels df_accum_naive/df_accum_precise could be unified
Same pattern as above. These are identical except for calling the _naive vs _precise helpers and wrapping e + lo in qd.precise. A single kernel parametrized by use_precise: qd.template() would cut ~20 lines:

hughperkins · 2026-04-13T18:18:45Z

Opus assessment on what has changed relative to my requests:

ADDRESSED (done in the latest update):

Missing tests — Several new tests added: test_qd_precise_recurses_through_bit_cast (walker through UnaryOp), test_qd_precise_stops_at_qd_func_call (walker boundary),
test_qd_precise_tag_travels_with_aliased_expr (aliasing behavior), test_qd_precise_idempotent_when_fast_math_off (fast_math=False interaction).
Near-identical bool precise{false} comments across 4 IR classes — All four comments now read // Set by 'qd.precise(...)'; see quadrants::lang::precise() in ir/expr.h for
the canonical contract. The canonical comment lives in expr.h. Done.
precise inserted out of order in __all__ — Moved from between cast/ceil to after pow. Done.
Inconsistent QD_NOT_IMPLEMENTED reformatting in spirv_codegen.cpp — Binary visitor's one-liner reformatted to multi-line. Done.
Synthesized stmts in alg_simp.cpp don't propagate precise — Now propagated for a ** 0.5 → sqrt(a), a ** n → mul chain, a ** -n → neg/pow/div, 2 * a → a + a, and
cast_to_result_type. Also propagated in demote_operations.cpp for pow → mul chain and floordiv → div+floor. Done.
UnaryOpStmt::same_operation() does not account for precise — Now checks precise != o->precise and returns false. Done.
TernaryOpExpression lacks precise field — document as intentional — Comment added in expr.cpp explaining why ternary nodes don't carry the flag and why that's correct.
Done.
Docstring says "every binary FP op" but code also tags unary ops — Updated to say "Every binary and unary FP op" in both the summary and Returns section. Done.
Empty if body for pow precise bail — Replaced with the recommended pattern: the if (stmt->precise) {} bail is removed since precise is now propagated onto synthesized
stmts. The exponent_* helpers run unconditionally. Done.
pow precise bail-out leaves runtime performance on the table — Addressed: precise is now propagated onto all synthesized stmts (sqrt, mul chain, neg/pow/div), so
qd.precise(x ** 4) gets the fast 3-multiply rewrite with IEEE-strict tags. The blanket bail-out is gone. Done.
SPIR-V maybe_no_contraction not inlined — Moved from .cpp to inline in the header. Done.
AMDGPU backend: precise flag silently ignored for BinaryOpStmt — disable_fast_math(llvm_val[stmt]) now called at the end of the AMDGPU visit(BinaryOpStmt*) override.
disable_fast_math promoted from file-local to static member of TaskCodeGenLLVM so subclasses can use it. Done.
SPIRV inv unary op: ir_->div() called without forwarding precise — Now passes stmt->precise explicitly. Done.
CUDA backend visit(BinaryOpStmt*): disable_fast_math before FPTrunc for f16 — Added. Same pattern applied in codegen_llvm.cpp base class for both unary and binary f16
paths, and the rsqrt intermediate in the inv lowering. Done.

────────────────────────────────────────

PARTIALLY ADDRESSED:
15. Stmt::make<> vs Stmt::make_typed<> inconsistency — The PR converted the stmts that need precise propagation to make_typed<>, but left three others as make<> (lines 381,
435, 448 — the bit_shl, mul in optimize_division, and bit_sar paths). These are all integer-domain operations where precise has no effect, so it's functionally harmless, but
the stylistic mix remains in the same file. Still open as a style nit.

────────────────────────────────────────

NOT ADDRESSED (acknowledged or deferred):
16. disable_fast_math clears 7 flags individually — The comment in the header still says setFastMathFlags(FastMathFlags{}) only OR's in flags on this LLVM version. Not
changed. You asked for acknowledgment — it seems the reviewer kept the status quo, judging it correct for the targeted LLVM version. Open — worth verifying against the
actual LLVM version used.
17. Expression walker uses shared_ptr copies and RTTI — Acknowledged as "not a practical concern today" in the original review. Not addressed (and not requested).
18. No @quadrants_scope decorator on precise() — Not addressed. Consistent with cast()/bit_cast() neighbors but still a missed opportunity. Open.
19. In-place mutation aliasing hazard — The aliasing behavior is now tested (test_qd_precise_tag_travels_with_aliased_expr), which is good, but the fundamental design (in-place
mutation via shared_ptr) is unchanged. Open as a design question.
20. Commutative swap in binary_op_simplify.cpp not gated on precise — Not addressed. Still swaps operands before the precise check. IEEE-correct but violates "source-order"
spirit. Open.
21. Test factorization — parametrize unary rounding test — Not addressed. Still tests all four ops in one kernel. Open.
22. Test factorization — unify naive/precise helpers with qd.static() — Not addressed. Open.
23. a / const → a * (1/const) rewrite in optimize_division (line 435) — This specific rewrite is already guarded by fast_math && !stmt->precise at line 427, so it won't fire
for precise ops. The synthesized mul at line 435 doesn't propagate precise, but it never needs to because the path is unreachable when stmt->precise is true. Not a bug, but
the make<> → make_typed<> style nit remains.

hughperkins · 2026-04-13T18:21:35Z

Looks like a bunch of things have not been addressed yet?

Of these, I think the most critical is the current design decision to modify in-place, giving uinuitive aliasing effects. Would also like the tests to be factorized to use parametrize.

claude

LGTM — the inline nit about SPIR-V NoContraction on atan2/pow is the only remaining item; as the PR itself documents, NoContraction scopes to arithmetic instructions per spec, so its absence on OpExtInst transcendentals is a known best-effort limitation rather than a regression.

Extended reasoning...

Overview

PR adds qd.precise(expr) across 23 files: Python API, frontend IR, backend IR (both Expression and Stmt layers), three codegen backends (LLVM/CUDA/AMDGPU/SPIR-V), two transform passes (alg_simp, binary_op_simplify, demote_operations), offline cache key, and tests + docs.

Security risks

None. This is a compile-time IR transformation that adds a boolean flag and gates fast-math rewrites on it. No auth, crypto, permissions, or external I/O involved.

Level of scrutiny

This PR warranted — and received — high scrutiny. It touches critical FP codegen paths on every backend. Through multiple review rounds, nine distinct correctness bugs were identified and fixed (unary pipeline, 2×a strength reduction, a-0 gate, pow/exponent synthesized-stmt tagging, demote_operations pow and ffloor, f16 rsqrt intermediate, f16 atan2/pow, AMDGPU override, SPIR-V inv). All were addressed in the current diff. The inline nit (SPIR-V NoContraction not emitted for atan2/pow via FLOAT_BINARY_OP_TO_SPIRV_FLOAT_FUNC) has zero functional impact — the PR description and code comments explicitly acknowledge that NoContraction scopes to arithmetic instructions and is best-effort on OpExtInst transcendentals.

Other factors

Seven distinct test functions covering the primary use case, unary rounding, walker boundaries (select, bit_cast, qd.func call sites, aliasing), idempotency under fast_math=False, and the offline-cache-key distinction.
User-facing documentation (precise.md) linked from the user guide, with an honest backend coverage table.
The PR author took full personal responsibility for the code at 16:26 and has been extremely responsive throughout the review cycle.

…(atan2/pow) results with NoContraction

…low to 120)

… before f16 FPTrunc

…ant mutation test, fix pow NOTE

…orator; reframe as contract check

…-promotion casts

…ration in visit(BinaryOpStmt)

…inary/Unary stmts

…rop duplicate builder-side precise

…backend)

…efore FPToSI

…se under NaN)

…C++ stack depth)

…+0 folds unconditionally

…stop, qualify doc/test idempotency claims

…mt; align ops.py docstring with precise.md

…-> four)

…tract test

…-gated, not fast_math-gated

duburcqa · 2026-04-14T21:57:39Z

This feature is fully ready but no longer actively required in Genesis. Closing for now, as requested by @hughperkins, the current official Quadrants's maintainer, in accordance with the current merging policy.

hughperkins · 2026-04-15T13:34:31Z

Please could you consider reducing the 'point of contact' between the experimental code in this branch, and the commonly used code-paths?

claude

Two correctness bugs found; deferring for author fix before merge.

Extended reasoning...

Overview

PR #476 adds qd.precise(...) for per-op IEEE-strict FP across 26 files, touching IR, codegen (LLVM, CUDA, AMDGPU, SPIR-V), algebraic simplification passes, and tests. The scope is non-trivial.

Bugs Found

Two correctness issues were identified by the automated review system and are flagged as inline comments on the relevant lines:

exponent_half_optimize violates the precise contract for a = -0.0 (quadrants/transforms/alg_simp.cpp): The rewrite a ** 0.5 → sqrt(a) is applied unconditionally even when stmt->precise is set. While the PR correctly propagates result->precise = stmt->precise to clear FMF on the synthesized sqrt, it does not prevent the semantic change: per IEEE 754 §9.2, pow(-0.0, 0.5) = +0.0, but per §5.4.1, sqrt(-0.0) = -0.0. Code calling qd.precise(x ** 0.5) with x = -0.0 will silently get the wrong sign bit. Fix: add if (stmt->precise) return false; at the top of exponent_half_optimize().
Three signed-zero tests missing arch restriction (tests/python/test_precise.py): test_qd_precise_recurses_through_select, test_qd_precise_recurses_through_bit_cast, and test_qd_precise_stops_at_qd_func_call assert precise_bits == 0x00000000 (i.e. (-0.0f) + 0.0f = +0.0f), which requires IEEE signed-zero preservation. SPIR-V's NoContraction decoration only prevents FMA contraction; without emitting the SignedZeroInfNanPreserve execution mode, conformant Vulkan/Metal drivers may return -0.0, causing spurious CI failures. The fix is to add arch=[qd.cpu, qd.cuda, qd.amdgpu] to all three test decorators, matching the pattern already used elsewhere in the file.

Security Risks

None. This is a floating-point precision primitive; no auth, crypto, or untrusted-input handling is involved.

Level of Scrutiny

The changes are substantial (26 files, new IR field, codegen changes across all backends) but well-structured. The two bugs are confined to a single optimization gate and three test decorators — both straightforward to fix. The rest of the PR looks mechanically correct given the design. Human review is still warranted for the overall approach, but these two issues should be addressed first.

Other Factors

No prior review activity on this PR. The PR has thorough inline documentation and tests covering the happy path. The missing guard in exponent_half_optimize is particularly subtle because the comment even claims the rewrite is "IEEE-equivalent" — which is true for all inputs except -0.0.

claude · 2026-04-15T14:13:54Z

      }
    }
    auto a = stmt->lhs;
-    cast_to_result_type(a, stmt);
-    auto result = Stmt::make<UnaryOpStmt>(UnaryOpType::sqrt, a);
+    cast_to_result_type(a, stmt, stmt->precise);
+    auto result = Stmt::make_typed<UnaryOpStmt>(UnaryOpType::sqrt, a);
    result->ret_type = a->ret_type;
+    // `a ** 0.5 -> sqrt(a)` is IEEE-equivalent, but the synthesized sqrt must carry `precise` so codegen clears FMF on
+    // it; otherwise `qd.precise(x ** 0.5)` silently gets `afn`-approximated.
+    result->precise = stmt->precise;
    stmt->replace_usages_with(result.get());
    modifier.insert_before(stmt, std::move(result));
    modifier.erase(stmt);


🔴 exponent_half_optimize() applies a**0.5 → sqrt(a) unconditionally even when stmt->precise is set, violating the qd.precise contract: pow(-0.0, 0.5) = +0.0 per IEEE 754, but sqrt(-0.0) = -0.0, so the substitution silently changes the sign bit. Add if (stmt->precise) return false; at the top of exponent_half_optimize() to skip the rewrite under precise mode.

Extended reasoning...

Bug summary

exponent_half_optimize() in alg_simp.cpp rewrites a**0.5 to sqrt(a) without consulting stmt->precise. The qd.precise contract promises that every algebraic identity rewrite is suppressed under precise mode so programs using qd.precise get bitwise-identical results to the unoptimized IEEE evaluation. This substitution breaks that promise for the specific input a = -0.0.

IEEE 754 signed-zero discrepancy

IEEE 754-2008 §9.2 (pow) defines pow(-0.0, 0.5) = +0.0 because pow treats -0.0 as a magnitude-zero positive number for half-integer exponents with even denominator. IEEE 754-2008 §5.4.1 (squareRoot) defines squareRoot(-0.0) = -0.0 because sqrt propagates the sign bit. These are both correct IEEE results for their respective operations, but they disagree on the sign bit. The rewrite therefore changes observable behavior for any program that relies on the sign of pow(-0.0, 0.5).

Why result->precise = stmt->precise does not fix this

The PR correctly propagates the precise flag to the synthesized UnaryOpStmt so codegen calls disable_fast_math() on the sqrt instruction, preventing the LLVM afn (approximate functions) fast-math flag from being applied. This addresses the codegen-level FMF concern but is orthogonal to the semantic issue: even with every fast-math flag cleared, sqrt(-0.0) hardware instruction still returns -0.0 on all IEEE-compliant targets (x86, RISC-V, SPIR-V, PTX). Clearing FMF cannot change the fact that the substituted operation computes a semantically different function on -0.0.

Concrete proof of the violation

Consider the following sequence under qd.precise mode:

User writes: qd.precise(a ** 0.5) where a = -0.0 (float32 bit pattern 0x80000000)

exponent_half_optimize fires because get_exponent_values() returns {0.5} and the 0.5 check passes

The BinaryOpStmt pow is replaced with a UnaryOpStmt sqrt carrying precise=true

Codegen emits: sqrt(-0.0) with FMF cleared → result = -0.0 (0x80000000)

Expected per IEEE 754 pow semantics: pow(-0.0, 0.5) = +0.0 (0x00000000)
The user observes +0.0 becoming -0.0 — a sign-bit flip that strict IEEE arithmetic must not introduce.

False claim in source comment and test

alg_simp.cpp line 191 currently states "a ** 0.5 -> sqrt(a) is IEEE-equivalent". This is factually incorrect for a = -0.0. test_precise.py lines 479–483 contain a NOTE repeating the same claim and citing it as justification for omitting a behavioral test for this rewrite; that NOTE should be removed or corrected once the guard is added.

Fix

Add a guard at the start of exponent_half_optimize():

0 ""

0 ""

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""

This is the minimal, conservative fix: precise programs skip the rewrite entirely and the codegen path for pow with precise=true already handles FMF clearing. The test NOTE in test_precise.py should also be updated to reflect that the a**0.5 rewrite is intentionally suppressed under precise mode.

hughperkins · 2026-04-16T08:13:39Z

Chatted with Opus for a while, about reducing point of contact. It came up with:

Yes, exactly. Add a generic pre/post hook to the stmt dispatch loop. The codegen knows nothing about what the hooks do; the precise feature registers its own.
  The change to each codegen base class is ~10 lines:
  LLVM (`codegen_llvm.h` / `codegen_llvm.cpp`):

  // codegen_llvm.h -- generic hook type
  using CodegenStmtHook = std::function<void(TaskCodeGenLLVM &, Stmt *)>;
  std::vector<CodegenStmtHook> pre_stmt_hooks_;
  std::vector<CodegenStmtHook> post_stmt_hooks_;
  void add_pre_stmt_hook(CodegenStmtHook hook) { pre_stmt_hooks_.push_back(std::move(hook)); }
  void add_post_stmt_hook(CodegenStmtHook hook) { post_stmt_hooks_.push_back(std::move(hook)); }


  // codegen_llvm.cpp -- the only change to visit(Block*)
  void TaskCodeGenLLVM::visit(Block *stmt_list) {
    for (auto &stmt : stmt_list->statements) {
      for (auto &h : pre_stmt_hooks_) h(*this, stmt.get());
      stmt->accept(this);
      for (auto &h : post_stmt_hooks_) h(*this, stmt.get());
      if (returned) break;
    }
  }

  That's it for the codegen side. Completely generic. Same pattern for SPIR-V's TaskCodegen.
  Then in a new file (e.g. `precise_codegen_hooks.cpp`), the precise feature registers itself:

  void install_precise_hooks_llvm(TaskCodeGenLLVM &cg) {
    llvm::FastMathFlags saved;
    cg.add_pre_stmt_hook([&saved](TaskCodeGenLLVM &cg, Stmt *stmt) {
      if (stmt->codegen_hints & kDisableFastMath) {   // or however we track it
        saved = cg.builder->getFastMathFlags();
        cg.builder->setFastMathFlags({});
      }
    });
    cg.add_post_stmt_hook([&saved](TaskCodeGenLLVM &cg, Stmt *stmt) {
      if (stmt->codegen_hints & kDisableFastMath) {
        cg.builder->setFastMathFlags(saved);
      }
   });
  }

  The codegen infrastructure doesn't know about precise, fast-math, or anything specific. It just fires callbacks. The precise feature brings its own logic.
  But there's still the question: how does the hook know which stmts are precise? That requires something on the stmt. We're back to the same problem -- unless the fence pass
  handles it.


  The fully decoupled design: fence pass + hooks

  With the fence pass, we don't need any marker on the stmt at all:
  1. Before optimization: fence pass replaces precise ops with InternalFuncStmt wrappers. Transforms never see them.
  2. After optimization, before codegen: unfence pass converts them back to BinaryOpStmt/UnaryOpStmt and adds their stmt IDs to a `std::unordered_set<int>` stored on the
     compilation unit (e.g. on OffloadedStmt or a context object the codegen already receives).
  3. The hook reads that set, not the stmt:


  void install_precise_hooks_llvm(TaskCodeGenLLVM &cg,
                                  const std::unordered_set<int> &precise_ids) {
    cg.add_pre_stmt_hook([&](TaskCodeGenLLVM &cg, Stmt *stmt) {
      if (precise_ids.count(stmt->id)) {
        // save & clear FMF
      }
    });
    cg.add_post_stmt_hook([&](TaskCodeGenLLVM &cg, Stmt *stmt) {
      if (precise_ids.count(stmt->id)) {
        // restore FMF
      }
    });
  }

  Zero fields added to any existing IR class. The side table of stmt IDs is created by the fence/unfence pass and consumed by the hook. Both live in the precise feature's own
  files.


  Final inventory


  ┌────────────────────────────┬───────────────────────────────────────────┬──────────────────────────────────────────────────────┐
  │ What                       │ Where                                     │ Touches existing code?                               │
  ├────────────────────────────┼───────────────────────────────────────────┼──────────────────────────────────────────────────────┤
  │ qd.precise() API           │ ops.py, export_lang.cpp                   │ 1 line binding                                       │
  │ Expression tagger          │ new ir/precise.cpp                        │ 0                                                    │
  │ bool precise on Expression │ frontend_ir.h                             │ 4 lines (consumed only by fence pass during flatten) │
  │ Fence/unfence pass         │ new transforms/precise_fence.cpp          │ 0                                                    │
  │ Pass calls                 │ compile_to_offloads.cpp                   │ 2 lines                                              │
  │ Hook infrastructure        │ codegen_llvm.h/.cpp, spirv_codegen.cpp    │ ~10 lines each (generic, reusable)                   │
  │ Precise hook registration  │ new precise_codegen_hooks.cpp             │ 0                                                    │
  │ Side table of IDs          │ produced by fence pass, consumed by hooks │ 0 (no field on any class)                            │
  └────────────────────────────┴───────────────────────────────────────────┴──────────────────────────────────────────────────────┘

  The existing IR classes, transforms, and codegen visitors are untouched. The hook infrastructure is generic and reusable for any future per-stmt codegen customization.

Thoughts?

hughperkins · 2026-04-16T12:29:02Z

created a branch to try to reduce contact points https://github.com/Genesis-Embodied-AI/quadrants/compare/experimental...hp/precise-pt_contact?expand=1

contact points still seem too broad to me. might prod opus a bit more to shrink these, at some point (eg cuda codegen has precise-specific stuff added, forexample)

(Note to self: implemented by deskai4; 8h30 ET apr 16 2026)

chatgpt-codex-connector Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread quadrants/transforms/alg_simp.cpp

claude Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread quadrants/ir/expr.cpp

duburcqa force-pushed the duburcqa/qd_precise branch from 1eae2b4 to 8f02070 Compare April 13, 2026 14:36

duburcqa commented Apr 13, 2026

View reviewed changes

Comment thread quadrants/transforms/alg_simp.cpp

Comment thread quadrants/transforms/alg_simp.cpp

Comment thread tests/python/test_precise.py

Comment thread tests/python/test_precise.py

Comment thread quadrants/codegen/cuda/codegen_cuda.cpp Outdated

duburcqa force-pushed the duburcqa/qd_precise branch from 556c072 to 2669dc5 Compare April 13, 2026 15:18

duburcqa commented Apr 13, 2026

View reviewed changes

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp Outdated

Comment thread quadrants/codegen/spirv/spirv_ir_builder.h Outdated

Comment thread quadrants/ir/expr.cpp Outdated

claude Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread quadrants/transforms/alg_simp.cpp

Comment thread quadrants/transforms/alg_simp.cpp

hughperkins reviewed Apr 13, 2026

View reviewed changes

Comment thread docs/source/user_guide/precise.md Outdated

duburcqa force-pushed the duburcqa/qd_precise branch from c327965 to b229ca5 Compare April 13, 2026 16:46

hughperkins reviewed Apr 13, 2026

View reviewed changes

Comment thread docs/source/user_guide/precise.md

claude Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp

Comment thread docs/source/user_guide/precise.md

Comment thread python/quadrants/lang/ops.py

claude Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp

Comment thread quadrants/transforms/alg_simp.cpp

claude Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread quadrants/codegen/llvm/codegen_llvm.cpp

Comment thread quadrants/ir/expr.cpp Outdated

claude Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp

claude Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread tests/python/test_precise.py Outdated

Comment thread quadrants/codegen/cuda/codegen_cuda.cpp

duburcqa added 20 commits April 14, 2026 23:49

[Lang] qd.precise: SPIR-V visit(BinaryOpStmt) tags FP transcendental …

cf9023a

…(atan2/pow) results with NoContraction

[Lang] qd.precise: reflow PR-introduced C++ comments to 120 cols

4259432

[Lang] qd.precise: propagate tag through cast in 2*a rewrite (and ref…

0c47065

…low to 120)

[Lang] qd.precise: CUDA emit_extra_unary clears FMF on libdevice call…

41801a7

… before f16 FPTrunc

[Lang] qd.precise: skip sin/cos unary-rounding on SPIR-V, drop redund…

e01778b

…ant mutation test, fix pow NOTE

[Lang] qd.precise: unary-rounding test restricts to LLVM via arch dec…

5a2dbb9

…orator; reframe as contract check

[Lang] qd.precise: type_check propagates tag through implicit operand…

e3196b7

…-promotion casts

[Lang] qd.precise: document SPIR-V arithmetic/post-hoc two-layer deco…

8e52ee1

…ration in visit(BinaryOpStmt)

[Lang] qd.precise: scalarize propagates tag onto per-element scalar B…

4aa6c7f

…inary/Unary stmts

[Lang] qd.precise: SPIR-V decorates FP ops once via post-hoc block; d…

14fb6ca

…rop duplicate builder-side precise

[Lang] qd.precise: idempotency test also covers AMDGPU (also an LLVM …

7f34d62

…backend)

[Lang] qd.precise: AMDGPU i32 pow clears FMF on __ocml_pow_f64 call b…

5676eb8

…efore FPToSI

[Lang] qd.precise: exclude cmp_gt/cmp_lt from precise guard (IEEE-fal…

43c4367

…se under NaN)

[Lang] qd.precise: iterative worklist in clone_and_tag_precise (O(1) …

85fbb6c

…C++ stack depth)

[Lang] qd.precise: precise_fp_add requires FP operand type; integer a…

94fbfc5

…+0 folds unconditionally

[Lang] qd.precise: fix same_operation comment, document IdExpression …

b519f33

…stop, qualify doc/test idempotency claims

[Lang] qd.precise: IR printer annotates [precise] on Unary/BinaryOpSt…

0eb62de

…mt; align ops.py docstring with precise.md

[Lang] qd.precise: fix op count in precise.md example comment (three …

acdcfbd

…-> four)

[Lang] qd.precise: add rsqrt to unary-rounding test; add floordiv con…

426198e

…tract test

[Lang] qd.precise: fix fast_math=False table row; a+0 fold is precise…

cafb630

…-gated, not fast_math-gated

duburcqa force-pushed the duburcqa/qd_precise branch from 7fc4b56 to cafb630 Compare April 14, 2026 21:49

duburcqa closed this Apr 14, 2026

hughperkins reopened this Apr 15, 2026

hughperkins changed the base branch from main to experimental April 15, 2026 13:33

claude Bot reviewed Apr 15, 2026

View reviewed changes

Merge branch 'experimental' into duburcqa/qd_precise

6712b0c

Conversation

duburcqa commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

qd.precise(...): per-op IEEE-strict FP for compensated arithmetic

TL;DR

Why

Surface API

Mechanism end-to-end

1. AST + IR — one bool field, two layers

2. Walker — precise() free function

3. Transforms — gate FP folds on !stmt->precise

4. Codegen — three backends, two mechanisms

LLVM (CPU / CUDA / AMDGPU) — per-instruction FMF clear

SPIR-V (Metal / Vulkan, including MoltenVK) — per-op NoContraction decoration

CUDA libdevice — explicit fast/precise selection

5. Offline cache key — content-addressed, must include the new field

Per-backend coverage matrix

Tests — tests/python/test_precise.py

test_qd_precise_protects_fast_math — observable on all backends

test_qd_precise_unary_rounding — verifies the unary path on every backend

Side-effect audit (what can't go sideways)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 13, 2026

Uh oh!

duburcqa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 13, 2026

Uh oh!

duburcqa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins commented Apr 13, 2026

Uh oh!

duburcqa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

duburcqa commented Apr 13, 2026

Uh oh!

duburcqa commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duburcqa commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

duburcqa commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

duburcqa commented Apr 13, 2026 •

edited

Loading

`qd.precise(...)`: per-op IEEE-strict FP for compensated arithmetic

2. Walker — `precise()` free function

3. Transforms — gate FP folds on `!stmt->precise`

SPIR-V (Metal / Vulkan, including MoltenVK) — per-op `NoContraction` decoration

Tests — `tests/python/test_precise.py`

`test_qd_precise_protects_fast_math` — observable on all backends

`test_qd_precise_unary_rounding` — verifies the unary path on every backend

hughperkins commented Apr 13, 2026 •

edited

Loading

duburcqa commented Apr 13, 2026 •

edited

Loading

hughperkins commented Apr 13, 2026 •

edited

Loading

What the doc says `qd.precise` does

1. Repeated `(fast_math && !stmt->precise) || is_integral(...)` guard â€” `alg_simp.cpp`

2. Identical `if (stmt->precise) { disable_fast_math(...); }` blocks â€” `codegen_llvm.cpp`

3. Near-identical `bool precise{false}` field + comment across four IR classes

4. `precise` propagation in `flatten()` â€” parallel boilerplate

5. Test helpers: `two_sum_naive` / `two_sum_precise` and `fast_two_sum_naive` / `fast_two_sum_precise`

6. SPIR-V `maybe_no_contraction` indirection

1. Missing newline at end of `precise.md`

2. PR perpetuates the `ARTHIMATIC` typo

3. `precise` inserted out of order in `all`

4. Empty `if` body as a control-flow guard

1. `all` ordering in `python/quadrants/lang/ops.py` (line 1564)

2. Missing trailing newline in `docs/source/user_guide/precise.md`

3. Inconsistent `QD_NOT_IMPLEMENTED` reformatting in `quadrants/codegen/spirv/spirv_codegen.cpp`

1. `all` ordering: `"precise"` breaks alphabetical sort

2. `Stmt::make<>` vs `Stmt::make_typed<>` inconsistency within the same file

3. Synthesized stmts in `alg_simp.cpp` don't propagate `precise` â€” inconsistent treatment

4. `UnaryOpStmt::same_operation()` does not account for `precise`

5. `TernaryOpExpression` and `TernaryOpStmt` lack a `precise` field

6. The `BINARY_OP_TO_SPIRV_ARTHIMATIC` macro name preserves a misspelling

7. Python-side: no `@quadrants_scope` decorator on `precise()`

8. Error type: `ValueError` vs `TypeError`

9. Docstring in `ops.py` says "every binary FP op" but code also tags unary ops

1. Extract `(fast_math && !stmt->precise)` into a helper â€” 5 call sites in `alg_simp.cpp`

2. Fold the `if (stmt->precise)` guard into `disable_fast_math` â€” duplicated in `codegen_llvm.cpp`

3. Redundant double-check of `precise` in SPIR-V unary codegen

4. Empty `if` body for `pow` precise bail â€” `alg_simp.cpp`

5. `BinaryOpExpression::flatten` â€” minor: use `make_typed` for inline field set

1. `pow` precise bail-out leaves runtime performance on the table (alg_simp.cpp:469â€“472)

3. `disable_fast_math` clears 7 flags individually (codegen_llvm.cpp:31â€“41)

4. `precise()` expression walker uses `shared_ptr` copies and RTTI (expr.cpp:55â€“79)

5. SPIR-V `maybe_no_contraction` is a non-inlined call on every FP binary op

6. `QD_STMT_DEF_FIELDS` now includes `precise` â€” CSE/identity cost

1. AMDGPU backend: `precise` flag is silently ignored for `BinaryOpStmt` (codegen bug)

2. SPIRV `inv` unary op: `ir_->div()` called without `precise` (minor, masked)

3. Commutative swap in `binary_op_simplify.cpp` not gated on `precise`

5. `TernaryOpExpression` is traversed but never tagged `precise`

6. `alg_simp.cpp`: `1 * a -> a` fold is not gated on `precise`

7. `constant_fold.cpp` and `simplify.cpp` don't check `precise`

8. Offline cache key serialization: `TernaryOpExpression` not serialized

2. Unify `two_sum_naive`/`two_sum_precise` and `fast_two_sum_naive`/`fast_two_sum_precise` with `qd.static()`