[Lang] Add qd.precise(...) for per-op IEEE-strict FP.#476
[Lang] Add qd.precise(...) for per-op IEEE-strict FP.#476duburcqa wants to merge 41 commits intoexperimentalfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1eae2b4bcb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
1eae2b4 to
8f02070
Compare
|
Could we add some user facing doc please? |
duburcqa
left a comment
There was a problem hiding this comment.
Second pass — most of the review is inline. Overall the design (per-op IEEE-strict tag, backend-specific lowering) lands cleanly; my concerns are a narrow correctness gap in try_rearranging_const_rhs, a likely-overselling claim about SPIR-V NoContraction covering transcendentals, and a stale header comment. Plus a fragile negative-control assertion in the tests. Details inline.
|
PR description is AMAZING 🔥 |
duburcqa
left a comment
There was a problem hiding this comment.
Second pass (from scratch). Five new findings, all lower priority than the first round. Most concern alg_simp.cpp rewrites that aren't gated on precise and therefore can bypass the IEEE-strict intent. Details inline.
|
Once you've added user-facing doc, pelase could you work your way through the AI prompts at https://www.notion.so/Meta-Quadrants-code-reviews-in-the-presence-of-AI-coding-33e0c06194e18080920dd87681976146?source=copy_link#3400c06194e180fe98d1e835a77d4246 , pasting the resulting conversation into this PR. Here are the questions (Excluding a question about comment deletion, and about comments being 120 cahracter aligned):
I've been doing this by hand, but it takes a while. So, if you don't want to do it by hand, two alternatives I'm ok-ish with (but havent tried yet):
In all cases, totally fine to ask the AI to write its answer to a .md file, or similar, so it's convenient to copy and paste. (though I've just been doing it by hand, so far, but I've been thinking doing ti by hand might not be the most efficient approach...) Notes:
|
|
(I should probably add a question like:
|
556c072 to
2669dc5
Compare
duburcqa
left a comment
There was a problem hiding this comment.
Third pass. Prior-round items are all addressed in commits 5438801 and 2669dc5 (inner-precise bail in try_rearranging_const_rhs, stale header comment, SPIR-V transcendental claim walked back, negative-control dropped, __all__ position, signed-zero gate, pow gate, sqrt in unary test, CUDA use_fast hoist). Only residual I see is three non-ASCII chars that slipped through the earlier scrubbing. Details inline.
Tests cover the primary doc-claimed contracts:
Documented-but-untested behaviors:
Strong on primary use case, thin on edge cases covered in the "Where the walker stops" and "Caveats" sections.
One clear candidate: the LLVM FMF-clear block in codegen_llvm.cpp is duplicated at the end of visit(UnaryOpStmt*) (~17 lines) and visit(BinaryOpStmt*) (~17 lines). Could be a ~10-line helper: static void clear_fast_math_if_precise(bool precise, llvm::Value *v) { Saves ~14 lines net. Other patterns: the alg_simp (fast_math && !stmt->precise) recurs ~6 times but factoring would hurt readability; the SPIR-V DEFINE_BUILDER_BINARY_* macros are already factored.
Nothing significant. The one overstated claim (SPIR-V NoContraction covering transcendentals) was caught in review and walked back in both the code comment at spirv_codegen.cpp:888-895 and the PR description's coverage matrix. Code is honest about the best-effort nature
None spotted. test_precise.py follows the existing @test_utils.test + kernel + assertion style. precise.md matches static.md in structure (title, intro, sections). Field placement in BinaryOpStmt / UnaryOpStmt and inclusion in QD_STMT_DEF_FIELDS matches existing
None significant. bool precise{false} default-init matches existing IR field style. maybe_no_contraction helper in SPIR-V builder uses the same this->decorate(...) pattern as nearby code. LLVM per-flag clear matches how setHas*(false) is used elsewhere (there's no
Modestly yes: the LLVM FMF-clear refactor in Q2 saves ~14 lines. The walker is as compact as it can be. SPIR-V builder plumbing is bound by macro symmetry (adding optional precise to 5 existing binary ops). Not much further reducible without losing clarity.
qd.precise is a compile-time primitive with zero runtime cost on un-wrapped paths. Wrapped paths are slower by design (FMF-cleared LLVM ops, NoContraction, non-fast libdevice). Nothing to chase at runtime. Compile time: O(n) walker over expression AST, two extra emit(bool) calls in cache-key generation. Negligible.
One known best-effort limitation (documented): qd.precise(qd.sin/cos/log(x)) on SPIR-V backends. SPIR-V spec scopes NoContraction to arithmetic instructions; most consumers ignore it on the OpExtInst calls used for transcendentals. Correctness on those paths relies on Otherwise: alg_simp gates, try_rearranging_const_rhs bail, pow branch guard, LLVM FMF clear, and CUDA libdevice opt-out collectively cover identified paths. Clone / IR-equality / CSE / cache-key all auto-carry precise via QD_STMT_DEF_FIELDS.
(Expanded from Q1.) Code added but not directly tested:
Biggest gaps: pow branch and the Python rejection are entirely untested.
The two tests are fundamentally different contracts, so merging them doesn't fit. Within test_qd_precise_unary_rounding, sin/cos/log/sqrt could be split with pytest.parametrize but bundling in one kernel is more compile-efficient. test_qd_precise_protects_fast_math runs two kernels specifically to exercise the cache-key distinction - could reformulate as: @qd.kernel Same compile outcome (two kernel variants under the hood), marginally cleaner. Not a big win; current structure is defensible.
|
|
Docs: added docs/source/user_guide/precise.md (linked from the user guide TOC under Core concepts, next to static.md). Covers what qd.precise does, walker scope rules, interaction with fast_math, a per-backend coverage table (honest about the SPIR-V transcendentalvcaveat), and a Dekker 2Sum worked example. Follow-ups arising from the review questions below (in c5fbab6):
|
|
@claude review |
|
I was assisted by Claude Opus to write this PR. I have read and review every changes in this PR. I take full responsibility for the lines added and removed in this PR. I won't blame any issue on Claude Opus. I have reviewed this PR, and approve it. |
c327965 to
b229ca5
Compare
|
From your earlier questions to AI, please could we add the missing tests that were identifeid: TernaryOp (qd.select) recursion |
|
Responses from Opus for AI questiosn above, using one agent per question: Note: I am NOT requesting these changes. I'm posting them so we have the information. I'm going to read these changes after posting, and request any changes from this post that I feel make sense to me.To what extent do the tests match what is described in the doc?Q1: To what extent do the tests match what is described in the doc?What the doc says
|
| Doc behavior | Test | How |
|---|---|---|
Core purpose: prevent algebraic folding under fast_math=True |
test_qd_precise_protects_fast_math |
Full Dekker 2Sum accumulator; asserts compensation term is non-zero and compensated sum beats naive f32 by >1000x |
| UnaryOp tagging (sin, cos, log, sqrt) | test_qd_precise_unary_rounding |
Verifies results are within 2 ULP of correctly-rounded f32 reference with fast_math=True |
Walker descends through TernaryOp (qd.select) |
test_qd_precise_recurses_through_select |
Exploits signed-zero semantics: (-0.0) + 0.0 must yield +0.0 under IEEE, so the inner add must be tagged |
Scalar-only caveat: Vector/Matrix raises |
test_qd_precise_rejects_quadrants_classes |
pytest.raises(ValueError) for both Vector and Matrix |
| Offline-cache key distinguishes precise from non-precise | test_qd_precise_protects_fast_math (indirect) |
Runs naive kernel first; if cache key didn't account for precise, the precise kernel would silently reuse the naive artifact |
What is NOT tested
| Doc behavior | Gap |
|---|---|
Composite unary+binary recursion: qd.precise(qd.sqrt(a * a + b * b)) |
The doc's "What gets protected" section explicitly shows the walker descending through a UnaryOp (sqrt) to tag inner BinaryOps (mul, add). No test exercises this path. test_qd_precise_unary_rounding wraps individual unary ops over a single load (qd.sin(x[i])), so the walker never encounters an inner BinaryOp beneath a UnaryOp. |
| Walker stops at loads | No test verifies the walker does not descend into ndarray indexing or field access. |
| Walker stops at constants | Not tested. |
Walker stops at qd.func call sites |
The doc explicitly warns that wrapping at the call site does not protect ops inside the callee. No test verifies this boundary. |
| Walker stops at atomic ops | Not tested. |
fast_math=False + qd.precise |
The doc's interaction table says precise is "redundant but harmless" when fast_math=False. No test runs with fast_math=False to confirm this. |
| FMA contraction prevention | The doc lists FMA contraction as a distinct optimization that qd.precise defeats. No test isolates this (the 2Sum test may indirectly exercise it, but doesn't assert on it). |
| Reassociation prevention | Listed as a distinct optimization. No test isolates (a + b) + c vs a + (b + c) reordering. |
a / a -> 1 algebraic fold |
The doc lists this explicitly. Not tested (a - a -> 0 is covered indirectly by 2Sum). |
| Aliasing caveat | The doc says "if you alias a subexpression and then wrap one alias, both uses get IEEE semantics." No test verifies this shared-tag behavior. |
Other unary ops: neg, exp, rsqrt, casts, bit_cast |
Only sin/cos/log/sqrt are tested. The doc's list is longer. |
Other binary ops: %, comparisons, bit ops on FP types |
Only +, -, * are exercised (via 2Sum). |
| Backend-specific behavior | Tests are backend-agnostic by design (run on whatever backend is available). No test targets SPIR-V NoContraction decoration, CUDA libdevice fast/non-fast variant selection, or the best-effort transcendental caveat on Vulkan/Metal. This is arguably appropriate for a Python-level test suite, but the CUDA libdevice path change (the diff shows __nv_fast_sinf vs __nv_sinf selection) has no dedicated coverage. |
Tests that go BEYOND the doc
-
Offline-cache key validation (
test_qd_precise_protects_fast_math): The doc never mentions cache key generation. The test deliberately runs the naive kernel first to smoke-test that the cache key serializer (changed ingen_offline_cache_key.cpp) correctly differentiatesprecisefrom non-preciseops. -
Signed-zero semantics (
test_qd_precise_recurses_through_select): The doc does not discuss signed zeros. The test uses-0.0 + 0.0yielding+0.0as the observable signal thatalg_simp'sa + 0 -> afold was blocked — a clever technique not mentioned in the doc. -
powprecise-bail note: The test file includes a trailing comment explaining whypowunderqd.preciseis deliberately untested (the rewrites are IEEE-equivalent, so no observable difference). The doc does not mentionpowat all, but the implementation (visible in the diff'salg_simp.cppchanges) does gate pow rewrites on!stmt->precise.
Summary
The tests cover the primary use case (compensated arithmetic surviving fast_math=True) thoroughly and with a well-designed end-to-end workload. They also cover the unary-op path, the TernaryOp recursion, and the scalar-only error. The main gaps are around walker boundary conditions (stops at loads, constants, qd.func calls, atomics), the composite unary-wrapping-binary recursion the doc highlights, the aliasing caveat, the fast_math=False interaction, and individual algebraic-simplification rules beyond a - a -> 0. These are mostly "negative" or "boundary" tests rather than core-functionality tests, but several correspond to explicitly documented behaviors that a reader would expect to see exercised.
Is there any code that is duplicate or could be refactored?
Overall the PR is well-structured. Most patterns exist for good reason (the precise flag genuinely needs to be threaded through four separate layers: Expression → Stmt → transform → codegen). That said, there are a few concrete refactoring opportunities:
1. Repeated (fast_math && !stmt->precise) || is_integral(...) guard — alg_simp.cpp
The expression (fast_math && !stmt->precise) || is_integral(stmt->ret_type.get_element_type()) (or a close variant using stmt->lhs->ret_type) appears five times in quadrants/transforms/alg_simp.cpp:
- Line 348 (
0 * a → 0) - Line 402 (
a / a → 1) - Line 408 (standalone
fast_math && !stmt->precise && ...for reciprocal rewrite) - Line 464 (
a - a → 0) - Line 513 (
a == a → 1)
Before this PR, the pattern was already fast_math || is_integral(...) repeated in the same spots; the PR mechanically inserted && !stmt->precise into each one. This is a good candidate for a small inline helper:
// True when the compiler may apply fast-math-style algebraic folds to `stmt`.
bool may_fold(BinaryOpStmt *stmt) const {
return (fast_math && !stmt->precise) || is_integral(stmt->ret_type.get_element_type());
}This would remove five (soon more, as new simplifications are added) repetitions of the same three-part condition and make future additions less error-prone.
2. Identical if (stmt->precise) { disable_fast_math(...); } blocks — codegen_llvm.cpp
quadrants/codegen/llvm/codegen_llvm.cpp has the same two-line block in two places (lines 496–498 for UnaryOpStmt and lines 776–778 for BinaryOpStmt):
if (stmt->precise) {
disable_fast_math(llvm_val[stmt]);
}This is minor (two occurrences, three lines each), but if TernaryOpStmt ever gains a precise flag — or if the post-processing grows (e.g. adding metadata) — a small template helper or lambda would keep things DRY:
auto maybe_clear_fmf = [&](auto *s) {
if (s->precise) disable_fast_math(llvm_val[s]);
};Not critical today but worth noting.
3. Near-identical bool precise{false} field + comment across four IR classes
The bool precise{false}; field is added to four classes, each with a near-identical two-line doc comment ("Mirrors MSL/HLSL precise"):
UnaryOpExpression(frontend_ir.h:375–377)BinaryOpExpression(frontend_ir.h:401–403)UnaryOpStmt(statements.h:158–160)BinaryOpStmt(statements.h:254–256)
The comments are subtly inconsistent (the UnaryOpStmt version says "no contraction, no approximate implementations" while BinaryOpStmt says "no reassociation, no contraction, no algebraic folds"). This isn't a code-level refactoring opportunity per se (C++ doesn't have a great mechanism for sharing a field across unrelated class hierarchies), but the comment wording should be unified. A single canonical comment could be defined next to the precise() function in expr.h and the field-level comments shortened to a cross-reference.
4. precise propagation in flatten() — parallel boilerplate
The propagation from Expression → Stmt happens via identical one-liners in two flatten() methods:
UnaryOpExpression::flatten()atfrontend_ir.cpp:264:unary->precise = precise;BinaryOpExpression::flatten()atfrontend_ir.cpp:434:bin_stmt->precise = precise;
This is inherent to the current flatten-per-class design and is not independently refactorable without reworking the flatten infrastructure. Noting for completeness only — no action needed.
5. Test helpers: two_sum_naive / two_sum_precise and fast_two_sum_naive / fast_two_sum_precise
In tests/python/test_precise.py (lines 11–59), the test defines four @qd.func helpers that come in naive/precise pairs:
two_sum_naivevstwo_sum_precise— identical structure, differ only inqd.precise(...)wrappersfast_two_sum_naivevsfast_two_sum_precise— same
The duplication is intentional (the test needs both variants to compare behavior), so this is fine as-is. A parameterized factory could reduce lines but would hurt test readability. No action recommended.
6. SPIR-V maybe_no_contraction indirection
spirv_ir_builder.h declares maybe_no_contraction(Value v, bool precise) and spirv_ir_builder.cpp:678 defines it as:
void IRBuilder::maybe_no_contraction(Value v, bool precise) {
if (precise) {
this->decorate(spv::OpDecorate, v, spv::DecorationNoContraction);
}
}This is then called in:
DEFINE_BUILDER_BINARY_USIGN_OPmacro (line 692)DEFINE_BUILDER_BINARY_SIGN_OPmacro (line 706)IRBuilder::mod()(line 725)spirv_codegen.cpp:896(unary op visitor)
The helper is well-factored. No issue.
Summary
| Priority | Item | Location | Recommendation |
|---|---|---|---|
| Medium | Repeated (fast_math && !stmt->precise) || is_integral(...) guard |
alg_simp.cpp ×5 |
Extract a may_fold() helper |
| Low | Identical if (stmt->precise) disable_fast_math(...) blocks |
codegen_llvm.cpp ×2 |
Extract a lambda or template helper |
| Low | Inconsistent precise field comments |
frontend_ir.h, statements.h ×4 |
Unify wording |
| None | Test helper duplication | test_precise.py |
Intentional; no action |
| None | flatten() one-liners |
frontend_ir.cpp ×2 |
Structural; no action |
Is there any code that might be embarrassing if published publicly?
Overall verdict: No. This is a clean, professional PR. There are no secrets, offensive content, debug leftovers, or placeholder values. A few minor cosmetic items are worth noting:
1. Missing newline at end of precise.md
The new documentation file docs/source/user_guide/precise.md is missing a trailing newline (the diff shows \ No newline at end of file on line 130). This is a trivial lint issue but will show up in many editors and CI linters.
2. PR perpetuates the ARTHIMATIC typo
In quadrants/codegen/spirv/spirv_codegen.cpp, the macro BINARY_OP_TO_SPIRV_ARTHIMATIC is a pre-existing misspelling of "ARITHMETIC". This PR modifies the macro body (adding the bin->precise argument) but keeps the misspelled name. While not introduced by this PR, touching the macro without fixing the name is a missed opportunity and the typo will continue to be visible in the codebase.
3. precise inserted out of order in __all__
In python/quadrants/lang/ops.py, "precise" is inserted between "cast" and "ceil" (line 1564). The surrounding entries in that section of __all__ are roughly alphabetical (bit_cast, bit_shr, cast, ceil, cos, exp, ...), so precise should appear later, after pow. The test_api.py placement is correct (between pow and profiler), making the inconsistency more visible.
4. Empty if body as a control-flow guard
In quadrants/transforms/alg_simp.cpp, the pow precise-bail is written as:
if (stmt->precise) {
// Preserve the user's `pow()` call verbatim. The helpers below rewrite into sqrt/mul/div chains
// whose synthesized stmts inherit `precise=false`, stripping the IEEE-strict tag.
} else if (exponent_one_optimize(stmt)) {An empty if body with only a comment is a recognized pattern, and the comment explains the intent well, but some style guides flag it. A return; or early-continue would be more explicit.
Items explicitly NOT found
- No hardcoded secrets, API keys, or passwords.
- No offensive or inappropriate language in comments or variable names.
- No TODO/HACK/FIXME comments introduced by this PR (the ones visible in the diff context, e.g.
// FIXME: figure out why OpSRem does not workand// TODO: remove this field, are pre-existing). - No placeholder/dummy values that shouldn't ship.
- No debug
printf/print/std::cerrstatements left in. - No copyright issues.
- No misleading comments — the comments are accurate, thorough, and match the code behavior.
- Documentation is high quality with correct examples and a detailed backend-coverage table.
Are there any layout inconsistencies?
Overall the PR is well-formatted: C++ uses 2-space indentation with attached braces, Python uses 4-space indentation, and comment style is consistent with existing conventions. The .clang-format 120-column limit is respected throughout. That said, there are a few specific inconsistencies:
1. __all__ ordering in python/quadrants/lang/ops.py (line 1564)
"precise" is inserted between "cast" and "ceil", breaking the alphabetical ordering that the surrounding entries follow:
"bit_cast",
"bit_shr",
"cast",
"precise",
"ceil",
"cos",Alphabetically, "precise" belongs after "pow" (near line 1584). By contrast, tests/python/test_api.py correctly places it between "pow" and "profiler":
"pow",
"precise",
"profiler",The existing __all__ list isn't perfectly sorted (the tail — "max", "min", "select", "abs", "pow" — is out of order), but inserting "precise" right after "cast" in the middle of an alphabetically-sorted run is visually jarring and clearly unintentional.
2. Missing trailing newline in docs/source/user_guide/precise.md
The diff shows \ No newline at end of file at the end of the new documentation file. Every other .md file in docs/source/user_guide/ ends with a newline. This violates POSIX text-file conventions and will produce a git diff warning.
3. Inconsistent QD_NOT_IMPLEMENTED reformatting in quadrants/codegen/spirv/spirv_codegen.cpp
The PR reformats the one-liner in the unary visit() to a proper multi-line block:
else {
QD_NOT_IMPLEMENTED
}But the structurally identical pattern in the binary visit() is left as a single line:
else {QD_NOT_IMPLEMENTED} ir_->register_value(bin_name, bin_value);This creates an internal inconsistency in the same file. If the unary path was worth reformatting (which it was — AllowShortBlocksOnASingleLine: false in .clang-format), the binary path should get the same treatment.
4. No other layout issues found
The following were checked and are consistent:
- C++ indentation: 2-space indent throughout, matching
.clang-formatIndentWidth: 2. No tabs. - Brace placement: Attached braces (
BreakBeforeBraces: Attach) used consistently in all new code. - C++ comment style:
//line comments with 2 spaces before trailing comments, matchingSpacesBeforeTrailingComments: 2. - C++ line lengths: All new/changed lines are within the 120-column limit.
- Python indentation: 4-space indent throughout, PEP 8 compliant.
- Python docstring format: The
precise()docstring inops.pyuses the same Google-style format withArgs:,Returns:,Notes:, andExample::sections as the neighboringcast()andbit_cast()functions. - Python import ordering:
test_precise.pyfollows the sameimport numpy/import pytest/import quadrants/from tests import test_utilsordering as other test files (e.g.,test_cast.py). - Include ordering: No new includes are added.
- Macro formatting: The
DEFINE_BUILDER_BINARY_USIGN_OP/DEFINE_BUILDER_BINARY_SIGN_OPmacros inspirv_ir_builder.cppuse aligned backslash continuation, consistent with the previous version and with neighboring macros likeDEFINE_BUILDER_CMP_OP. - Blank line conventions: New function definitions are separated by blank lines;
if (stmt->precise)blocks are preceded by a blank line for visual separation, matching the existing pattern of adding breathing room before guard clauses. - Header field ordering: The
bool precise{false}fields infrontend_ir.handstatements.hare placed after the existing fields and before constructors, consistent with the existingcast_typefield placement inUnaryOpExpression.
Are there any coding inconsistencies?
Overall the PR is well-written and follows existing patterns closely. A few inconsistencies remain, ranging from minor style nits to potential correctness concerns:
1. __all__ ordering: "precise" breaks alphabetical sort
The __all__ list in ops.py is alphabetically sorted (bit_cast, bit_shr, cast, ceil, cos, …). The PR inserts "precise" immediately after "cast" (line 1564), violating the sort order — it should appear after "pow" and before "profiler" (or wherever p-entries belong). This mirrors the alphabetical intent of the existing list but was not respected.
2. Stmt::make<> vs Stmt::make_typed<> inconsistency within the same file
In alg_simp.cpp, the PR changes one Stmt::make<BinaryOpStmt> to Stmt::make_typed<BinaryOpStmt> for the 2 * a → a + a rewrite (line 377) so that it can set sum->precise. However, every other synthesized BinaryOpStmt in the same file still uses Stmt::make<BinaryOpStmt> (lines 224, 233, 270, 272, 362, 416, 429). The mix of make<> and make_typed<> for the same type within a single file is a style inconsistency. Either is fine, but the PR should pick the one already dominant in the file (make<>), or convert them all to make_typed<>.
3. Synthesized stmts in alg_simp.cpp don't propagate precise — inconsistent treatment
The 2 * a → a + a rewrite correctly copies stmt->precise to the new sum stmt. But several other rewrites in the same file create new BinaryOpStmt / UnaryOpStmt nodes that inherit precise=false by default:
a ** 0.5 → sqrt(a)(line 186): createsUnaryOpStmt(sqrt)without copyingprecise.a ** n → exponentiation by squaring(lines 224, 233): createsmulstmts withoutprecise.a ** -n → 1 / a**n(lines 268–272): createsneg,pow,divstmts withoutprecise.a / const → a * (1/const)(line 416): createsmulwithoutprecise.
The PR handles this for pow by bailing out entirely when stmt->precise is set (the if (stmt->precise) { ... } guard at line 470), which is correct for pow. But the a / const → a * (1/const) rewrite is guarded only by fast_math && !stmt->precise at a higher level (line 408), and the sqrt and exponentiation-by-squaring rewrites are not behind a precise guard at all. If a user wrote qd.precise(x ** 0.5), the exponent_half_optimize path would fire and produce an untagged sqrt, silently dropping the precise intent. This is a functional gap, not just a style issue.
4. UnaryOpStmt::same_operation() does not account for precise
same_operation() (in statements.cpp, line 25) only compares op_type and cast_type. Two UnaryOpStmts that differ only in precise will be reported as the "same operation." This could cause CSE or other passes that use same_operation() to merge a precise op with a non-precise one. The QD_STMT_DEF_FIELDS macro does include precise, so the field_manager.equal() path in same_statements.cpp is correct, but any call site relying on same_operation() alone could be a problem.
5. TernaryOpExpression and TernaryOpStmt lack a precise field
The walker in expr.cpp descends through TernaryOpExpression and tags its children, but never sets a precise flag on the TernaryOpExpression itself. TernaryOpStmt likewise has no precise field. This is internally consistent (select/ifte are not FP arithmetic), but it means the offline cache key generator (gen_offline_cache_key.cpp) does not emit precise for ternary nodes — which is fine today since they don't carry the flag, but creates an asymmetry: if a ternary's children are tagged precise, the ternary's cache key stays the same, and correctness relies entirely on the children's cache keys distinguishing themselves. This is correct but worth documenting as intentional.
6. The BINARY_OP_TO_SPIRV_ARTHIMATIC macro name preserves a misspelling
The macro is spelled ARTHIMATIC (should be ARITHMETIC). This is a pre-existing issue, but the PR widens the macro's signature (adding the bin->precise argument) without fixing the typo, which could have been a low-cost cleanup. Minor, but an opportunity missed.
7. Python-side: no @quadrants_scope decorator on precise()
bit_cast() and cast() in ops.py do not use @quadrants_scope either, so precise() is consistent with those immediate neighbors. However, many other ops in the file do use the decorator. Since precise() only makes sense inside a kernel/func body (it calls _qd_core.precise which requires an active IR context), the lack of the decorator means the error message for out-of-scope usage will be a cryptic C++ exception rather than a clean Python-level error. This matches cast/bit_cast behavior, so it is not a new inconsistency, but it is a missed opportunity to be consistent with the majority of the file.
8. Error type: ValueError vs TypeError
precise() raises ValueError("Cannot apply precise on Quadrants classes"). The existing bit_cast() does the same (ValueError). However, passing an object of the wrong type is arguably a TypeError, not a ValueError. This is consistent with bit_cast but inconsistent with Python conventions. Minor.
9. Docstring in ops.py says "every binary FP op" but code also tags unary ops
The docstring for precise() says "Every binary FP op inside obj is evaluated in source order…" and the Returns section says "with every reachable binary op tagged as precise". But the C++ implementation (and the longer body of the docstring) also tags UnaryOpExpression nodes. The summary line and Returns section are inaccurate — they should say "binary and unary FP ops."
Could anything be implemented in less code?
1. Extract (fast_math && !stmt->precise) into a helper — 5 call sites in alg_simp.cpp
The condition (fast_math && !stmt->precise) is repeated verbatim 5 times in quadrants/transforms/alg_simp.cpp (lines 348, 402, 408, 464, 513). Four of those also share the larger compound pattern ((fast_math && !stmt->precise) || is_integral(...)). A one-line helper on AlgSimp would collapse all of them:
bool may_fold_fp(const BinaryOpStmt *stmt) const {
return fast_math && !stmt->precise;
}Then the four compound sites become (may_fold_fp(stmt) || is_integral(stmt->ret_type.get_element_type())), and the standalone site (line 408) becomes just may_fold_fp(stmt). This eliminates a repeated two-operand boolean expression that reviewers have to mentally re-parse each time, and ensures a future change (e.g. adding a third condition) only needs to touch one place.
2. Fold the if (stmt->precise) guard into disable_fast_math — duplicated in codegen_llvm.cpp
The identical three-line block appears twice, once at the end of visit(UnaryOpStmt *) (line 496) and once at the end of visit(BinaryOpStmt *) (line 776):
if (stmt->precise) {
disable_fast_math(llvm_val[stmt]);
}The SPIR-V backend already does this the right way: maybe_no_contraction(v, precise) absorbs the bool check internally. Apply the same pattern to the LLVM side — make disable_fast_math accept the precise bool and early-return when false:
void maybe_disable_fast_math(llvm::Value *v, bool precise) {
if (!precise) return;
auto *inst = llvm::dyn_cast<llvm::Instruction>(v);
// ... clear flags ...
}Each call site collapses to a single line: maybe_disable_fast_math(llvm_val[stmt], stmt->precise);. This also makes the LLVM and SPIR-V backends consistent in style.
3. Redundant double-check of precise in SPIR-V unary codegen
In quadrants/codegen/spirv/spirv_codegen.cpp (around line 893):
if (stmt->precise && is_real(stmt->element_type())) {
ir_->maybe_no_contraction(val, /*precise=*/true);
}The call passes a literal true to maybe_no_contraction, which then internally checks if (precise). The inner check is guaranteed to be true. Either:
- Pass
stmt->preciseinstead oftrueand drop the outerstmt->precise &&guard (let the function handle it), or - Call
ir_->decorate(...)directly since we already knowpreciseis true.
Either option removes a layer of redundancy.
4. Empty if body for pow precise bail — alg_simp.cpp
The pow-handling block (around line 470) uses an empty if body as a no-op guard:
if (stmt->precise) {
// Preserve the user's `pow()` call verbatim.
} else if (exponent_one_optimize(stmt)) {An empty if body that exists only to skip the else chain is unconventional and easy to misread. A simple negated guard is shorter and clearer:
if (!stmt->precise) {
if (exponent_one_optimize(stmt)) {
// a ** 1 -> a
} else if (exponent_zero_optimize(stmt)) {
// ...
}
}5. BinaryOpExpression::flatten — minor: use make_typed for inline field set
In quadrants/ir/frontend_ir.cpp (around line 432), the PR expands a one-liner into three statements to set precise:
auto bin_stmt = std::make_unique<BinaryOpStmt>(type, lhs_stmt, rhs_stmt, /*is_bit_vectorized=*/false, dbg_info);
bin_stmt->precise = precise;
ctx->push_back(std::move(bin_stmt));This is fine, but if the BinaryOpStmt constructor accepted an optional precise parameter (as was done for the SPIR-V builder's add/sub/mul/div/mod functions), the flatten code would stay a one-liner. The same applies to UnaryOpStmt in the preceding UnaryOpExpression::flatten. This is a minor point — the current approach is defensible if keeping the constructor signature stable is preferred.
Summary
The biggest win is item #1: extracting the repeated fast_math && !stmt->precise guard into a helper. It touches 5 sites in alg_simp.cpp alone and would eliminate the most pervasive pattern of duplicated logic introduced by this PR. Items #2 and #3 are smaller but improve consistency between the LLVM and SPIR-V backends.
Could anything run faster at runtime?
TL;DR
The non-precise hot path is unaffected — generated code is identical to before. But one compile-time decision in alg_simp.cpp leaves significant runtime performance on the table for qd.precise(x ** n), and a few compile-time micro-inefficiencies could be tightened.
1. pow precise bail-out leaves runtime performance on the table (alg_simp.cpp:469–472)
This is the biggest issue. The current code unconditionally bails out of all exponent rewrites when precise is set:
if (stmt->precise) {
// Preserve the user's `pow()` call verbatim.
} else if (exponent_one_optimize(stmt)) {The test file's own comment acknowledges the rewrites (a**1 → a, a**0.5 → sqrt(a), a**n → mul chain) are IEEE-equivalent to the original pow(). The stated reason for bailing is that the synthesized BinaryOpStmt/UnaryOpStmt nodes inherit precise=false, stripping the tag. But that's easily fixable: propagate stmt->precise onto each synthesized node (the same way the 2*a → a+a rewrite already does at line 382). Then the user gets the fast mul-chain and keeps IEEE-strict semantics on every op.
As written, qd.precise(x ** 4) emits a pow(x, 4) libcall/intrinsic at runtime instead of three multiplies — that is an order-of-magnitude slower on every backend (GPU pow is ~50 cycles vs. ~12 cycles for three FMULs). Users of compensated arithmetic do use small-integer powers, so this is likely to bite.
Recommendation: Propagate precise onto synthesized stmts in exponent_n_optimize, exponent_half_optimize, exponent_negative_optimize (as is already done for the alg_is_two → add rewrite), and remove the blanket bail-out.
2. Non-precise runtime path: no regression
For every codegen backend, the precise check only fires when stmt->precise == true:
- LLVM:
disable_fast_math()is behindif (stmt->precise)— non-precise ops never call it, so the emitted LLVM IR has the same fast-math flags as before. - CUDA:
const bool use_fast = compile_config.fast_math && !stmt->precise— whenprecise=false, this evaluates to the same value as the oldcompile_config.fast_math. Same libdevice function is selected. - SPIR-V:
NoContractiondecorations are only appended whenprecise=true. The binary for non-precise kernels is byte-identical.
The alg_simp.cpp and binary_op_simplify.cpp changes add && !stmt->precise to conditions that already test fast_math. Since stmt->precise defaults to false and the branch predictor will trivially learn this, the added boolean check costs ≈0 cycles per statement visit during compilation. No runtime cost at all.
3. disable_fast_math clears 7 flags individually (codegen_llvm.cpp:31–41)
The function calls 7 separate setHas* methods. The comment says setFastMathFlags(FastMathFlags{}) only OR's flags in, but LLVM provides Instruction::copyFastMathFlags(FastMathFlags) (which replaces rather than ORs) and FPMathOperator::setFastMathFlags with a replace overload. A single call to that with a default-constructed FastMathFlags (all clear) would be cleaner and marginally faster. This is compile-time-only, but it's called once per precise op during codegen, and 7 virtual-dispatch setter calls are noisier than one.
Recommendation: Verify which LLVM version the project targets and switch to copyFastMathFlags(FastMathFlags{}) if available.
4. precise() expression walker uses shared_ptr copies and RTTI (expr.cpp:55–79)
The worklist is std::vector<Expr>, and Expr holds a shared_ptr<Expression>. Every stack.push_back(bin->lhs) performs an atomic ref-count increment; every stack.pop_back() decrements. The cur.cast<BinaryOpExpression>() call goes through std::dynamic_pointer_cast, which does a RTTI dynamic_cast under the hood. For a node that matches none of the three types (Binary/Unary/Ternary), all three dynamic_pointer_cast calls fire and fail.
This runs at kernel-definition time (Python → AST), not at GPU runtime, and expression trees are typically small (a handful of nodes per qd.precise() call). So the absolute cost is negligible. But if this ever needs to scale (e.g., wrapping large generated expression trees):
- Use raw
Expression*in the worklist instead ofExprto avoid atomic ref-count churn. - Use a type-tag / virtual dispatch pattern (e.g.,
expr->node_kind()enum) instead of trial-and-errordynamic_pointer_cast.
Verdict: Not a practical concern today, but worth noting for future-proofing.
5. SPIR-V maybe_no_contraction is a non-inlined call on every FP binary op
maybe_no_contraction(v, precise) is defined in spirv_ir_builder.cpp and called from macros in the same TU, so the compiler may inline it. But if it doesn't (separate compilation, no LTO), every add/sub/mul/div on FP types pays for a function call + branch even when precise=false. Making it inline in the header (it's a trivial one-liner) would guarantee the branch is folded away at the call site.
Recommendation: Move the body to the header as an inline method, or mark it __attribute__((always_inline)).
6. QD_STMT_DEF_FIELDS now includes precise — CSE/identity cost
Adding precise to the serialized fields of UnaryOpStmt and BinaryOpStmt means:
StmtFieldManager::equal()compares one more field per op during CSE passes.- The offline cache key is one byte longer per op.
This is correct and necessary (a precise a+b must not be deduplicated with a non-precise a+b), and the cost is trivially small. No concern.
7. SPIR-V decoration overhead at runtime
NoContraction is a SPIR-V metadata decoration. It does not add instructions to the shader binary — it's a directive to the downstream compiler (SPIRV-Cross, driver frontend) to avoid contraction. There is no per-invocation runtime cost for the decoration itself. The runtime cost is the absence of optimizations (no FMA contraction, no reassociation), which is the intended semantic.
Summary of actionable items
| Priority | Item | Impact |
|---|---|---|
| High | Propagate precise onto synthesized stmts in pow exponent rewrites instead of bailing out |
Avoids pow() libcall for qd.precise(x**n) — up to 5–10× faster for small integer exponents |
| Low | Replace 7 individual FMF clears with copyFastMathFlags({}) |
Cleaner code, marginal compile-time improvement |
| Low | Inline maybe_no_contraction in the SPIR-V builder header |
Avoids one function call per FP op during SPIR-V codegen |
| Negligible | Use raw pointers in precise() walker |
Only matters for hypothetical very large expression trees |
Could anything exhibit incorrect behavior?
Yes. There are several issues ranging from a clear codegen bug to latent correctness risks. Listed from most to least severe:
1. AMDGPU backend: precise flag is silently ignored for BinaryOpStmt (codegen bug)
The AMDGPU codegen (codegen_amdgpu.cpp:386-421) overrides visit(BinaryOpStmt *) for pow and atan2. For all other ops it delegates to TaskCodeGenLLVM::visit(stmt), which does apply disable_fast_math when stmt->precise is set. But for pow and atan2, the AMDGPU override returns without calling disable_fast_math:
void visit(BinaryOpStmt *stmt) override {
auto op = stmt->op_type;
// ...
if (op == BinaryOpType::pow) {
// ... calls __ocml_pow_* ...
} else if (op == BinaryOpType::atan2) {
// ... calls __ocml_atan2_* ...
}
// No `if (stmt->precise) disable_fast_math(...)` here!
}The CUDA backend has the same structural pattern for pow/atan2 (lines 660-710), and it too omits disable_fast_math — but this matters less for pow/atan2 since they're plain function calls, not FP intrinsics with FMF. Still, the omission means the FPExt/FPTrunc instructions around them (e.g. line 675, 708 in CUDA; and the CreateSIToFP/CreateFPToSI pair in AMDGPU line 403-406) do carry fast-math flags and won't have them cleared for precise ops. In practice, FMF on cast instructions is usually benign, but the inconsistency is a latent risk.
The doc table (line 104) claims AMDGPU has precise support ("LLVM FMF cleared") but no code changes were made to codegen_amdgpu.cpp. The base-class TaskCodeGenLLVM handles normal arithmetic, but emit_extra_unary is completely overridden in AMDGPU and never checks stmt->precise — the __ocml_* functions are always called unconditionally. Unlike CUDA (which has __nv_fast_* variants), AMDGPU's __ocml_* functions are already correctly-rounded, so this is not a behavioral bug today. But if the compiler ever adds __ocml_fast_* variants, this code path has no precise guard to prevent selecting them.
2. SPIRV inv unary op: ir_->div() called without precise (minor, masked)
In spirv_codegen.cpp:841, the inv unary op calls ir_->div() without forwarding stmt->precise:
} else if (stmt->op_type == UnaryOpType::inv) {
if (is_real(dst_dt)) {
val = ir_->div(ir_->float_immediate_number(dst_type, 1), operand_val);
} else {
QD_NOT_IMPLEMENTED
}This is masked because the post-hoc maybe_no_contraction(val, true) at line 895-896 decorates the same SPIR-V value ID afterward. So the OpFDiv does get NoContraction — but only by accident. If anyone refactors this code to return early before line 895, the decoration would be silently lost.
3. Commutative swap in binary_op_simplify.cpp not gated on precise
In binary_op_simplify.cpp:83-88, the commutative operand swap (const_lhs to RHS) happens unconditionally — before the precise check at line 91:
void visit(BinaryOpStmt *stmt) override {
auto const_lhs = stmt->lhs->cast<ConstStmt>();
if (const_lhs && is_commutative(stmt->op_type) && !stmt->rhs->is<ConstStmt>()) {
stmt->lhs = stmt->rhs;
stmt->rhs = const_lhs;
operand_swapped = true;
}
if ((!fast_math || stmt->precise) && !is_integral(stmt->ret_type)) {
return;
}For FP addition and multiplication, commutativity is IEEE-guaranteed (a + b == b + a bitwise), so this swap is technically correct. However, for a feature whose explicit purpose is "source-order evaluation," silently reordering operands is semantically surprising. If a future version of the precise contract promises source-order operand evaluation (as MSL/HLSL precise does), this would become a bug.
4. In-place mutation is not thread-safe and causes aliasing surprises
The precise() walker in expr.cpp mutates Expression nodes in-place via shared_ptr. The doc acknowledges this:
"If you alias a subexpression and then wrap one alias in
qd.precise, both uses get IEEE semantics."
This is documented, but it's still surprising behavior that could cause subtle bugs. Consider:
ab = a + b
x = qd.precise(ab) # tags the BinaryOpExpression in-place
y = ab * 2 # ab is now also precise — not what the user intendedThe user sees two independent-looking variables but they share an expression node. The in-place mutation means qd.precise has a non-local effect on every alias of every subexpression in the tree. This is a correctness hazard in any code that builds expression trees with shared subexpressions.
5. TernaryOpExpression is traversed but never tagged precise
The walker descends through TernaryOpExpression to reach inner ops, but does not set a precise flag on the ternary itself. The IR-level TernaryOpStmt has no precise field. This is correct for select (which lowers to a non-FP conditional move), but if a future ternary op has FP semantics (e.g. fma as a ternary), it would silently lack the precise tag.
6. alg_simp.cpp: 1 * a -> a fold is not gated on precise
The optimize_multiplication function folds 1 * a -> a (line 342-346) unconditionally:
if (alg_is_one(lhs) || alg_is_one(rhs)) {
// 1 * a -> a, a * 1 -> a
stmt->replace_usages_with(alg_is_one(lhs) ? stmt->rhs : stmt->lhs);
modifier.erase(stmt);
return true;
}This fold is IEEE-correct (1.0 * x == x for all finite x and for all NaN/Inf, under IEEE 754). So it's not wrong. But the precise tag's purpose is to prevent any rewriting of FP ops, and this fold does remove an instruction that the user explicitly marked. The analog in HLSL's precise would preserve this multiplication. This is a philosophical deviation from the stated "source-order, no simplification" contract.
Similarly, a / 1 -> a (line 395-400) is also unconditional and IEEE-correct but philosophically inconsistent.
7. constant_fold.cpp and simplify.cpp don't check precise
constant_fold.cpp folds BinaryOpStmt and UnaryOpStmt when both/all operands are constants. This can fold a precise-tagged operation into a constant, effectively removing it from the instruction stream before codegen ever sees the precise flag. For the constant-fold case specifically, the result is numerically identical (constant folding uses the host's IEEE arithmetic), so this is safe in practice. But it's another case where the precise tag is silently discarded.
simplify.cpp synthesizes BinaryOpStmt nodes (e.g. for LinearizeStmt) that inherit precise=false by default. This is fine since those synthetic ops are on integer index calculations, not user FP expressions.
8. Offline cache key serialization: TernaryOpExpression not serialized
gen_offline_cache_key.cpp serializes precise for UnaryOpExpression and BinaryOpExpression, but TernaryOpExpression has no precise field to serialize. This is currently consistent (ternary doesn't carry precise), but if the ternary ever gains a precise flag (see item 5), the cache key would need updating.
Summary of real bugs vs. latent risks
| Issue | Severity | Real bug today? |
|---|---|---|
AMDGPU: no disable_fast_math for pow/atan2 |
Medium | No (ocml is already correctly-rounded), but the doc overclaims |
SPIRV inv: div() without precise |
Low | No (masked by post-hoc decoration) |
Commutative swap before precise check |
Low | No (IEEE-correct), but violates spirit of precise |
| In-place aliasing mutation | Medium | Yes — surprising side-effects on shared subexpressions |
1 * a -> a fold not gated on precise |
Low | No (IEEE-correct), but violates stated "no simplification" contract |
Constant fold ignores precise |
Low | No (numerically identical result) |
Test coverage analysis
The PR adds four tests in test_precise.py plus one entry in test_api.py. Below is a per-item assessment of what is and is not exercised.
What IS covered
| # | Code change | How it is tested |
|---|---|---|
| 1 | Offline cache key (gen_offline_cache_key.cpp) |
Indirectly by test_qd_precise_protects_fast_math. The naive kernel compiles first, populating the cache; if the cache key did not include the precise flag, the precise kernel would reuse the naive artifact and the final assertion (ds_err < naive_err * 1e-3) would fail. The test comment on lines 91-95 explicitly calls this out. Clever. |
| 3 | LLVM disable_fast_math() |
Exercised end-to-end by both test_qd_precise_protects_fast_math (binary ops) and test_qd_precise_unary_rounding (unary ops) on any LLVM-based backend (CPU, CUDA, AMDGPU). |
| 6a | alg_simp: a + 0 → a / 0 + a → a precise gate |
Directly tested by test_qd_precise_recurses_through_select, which uses the (-0.0) + 0.0 signed-zero observable to distinguish the folded and non-folded cases. |
| 6b | alg_simp: a - a → 0 precise gate |
Indirectly tested by test_qd_precise_protects_fast_math. The Dekker 2Sum compensation (a - aa) + (b - bb) is algebraically zero; a - a → 0 is the dominant fold that kills it. The assertion that lo ≠0 validates this gate. |
| 8 | precise flag on UnaryOpStmt / BinaryOpStmt (statements.h QD_STMT_DEF_FIELDS) |
Propagation from Expression → Stmt is exercised by every end-to-end test. The QD_STMT_DEF_FIELDS inclusion also affects IR hashing/comparison, which is indirectly validated by the cache-key test above. |
| — | TernaryOp walker recursion | test_qd_precise_recurses_through_select directly tests that qd.precise(qd.select(...)) recurses through the TernaryOp and tags inner binary ops. |
| — | ValueError on Quadrants classes |
test_qd_precise_rejects_quadrants_classes covers Vector and Matrix. |
| — | Unary precise (sin/cos/log/sqrt) | test_qd_precise_unary_rounding verifies ≤2 ULP against a correctly-rounded f32 reference. |
| — | Public API surface | test_api.py verifies precise is in qd.__all__. |
What is NOT covered
| # | Code change | Gap |
|---|---|---|
| 2 | CUDA codegen use_fast variable (fast libdevice selection for sin/cos/log) |
Only exercised when the test runner happens to use a CUDA backend. There is no @pytest.mark.parametrize or arch filter ensuring CUDA-specific coverage. On CPU-only CI, this code path is dead. |
| 4 | SPIR-V NoContraction decoration (spirv_codegen.cpp) |
Same issue: only exercised when the active backend is Vulkan or Metal. No SPIR-V-specific test. No unit test that inspects the generated SPIR-V IR for the decoration. |
| 5 | SPIR-V IR builder precise parameter on add/sub/mul/div/mod |
Same as #4 — backend-dependent, no unit-level test of the IR builder itself. |
| 6c | alg_simp: 0 * a → 0 precise gate |
Not tested. A test could do qd.precise(x * zero) where x is NaN; under IEEE NaN * 0 = NaN, but the fast-math fold would yield 0. |
| 6d | alg_simp: a / a → 1 precise gate |
Not tested. A test could do qd.precise(x / x) where x = 0.0; under IEEE 0/0 = NaN, fast-math folds to 1. |
| 6e | alg_simp: a / const → a * (1/const) precise gate |
Not tested. Could be observed with a carefully chosen constant where a * (1/c) and a / c differ at the ULP level. |
| 6f | alg_simp: 2 * a → a + a precise propagation |
The transformation itself is exercised (it fires for both precise and non-precise), but the test never verifies that the synthesized add carries sum->precise = stmt->precise. A regression here would silently strip the tag on the generated add. |
| 6g | alg_simp: comparison a == a → 1 precise gate |
Not tested. With NaN inputs, NaN == NaN should be false under IEEE but the fast-math fold makes it true. |
| 7 | binary_op_simplify.cpp changes |
Neither the try_rearranging_const_rhs precise bail nor the (!fast_math || stmt->precise) early return is directly tested. The 2Sum test may exercise the early return indirectly (the precise ops have const-zero operands that would match rearrangement patterns), but there is no targeted assertion. |
| 9 | fast_math=False + qd.precise (redundant but should work) |
Not tested. All kernel tests use fast_math=True. A simple test with fast_math=False plus qd.precise wrapping would verify the redundant path doesn't crash or regress. |
| 10 | Different data types (f64, f16) | Every test uses qd.f32 exclusively. f64 is important because the CUDA codegen for sin/cos/log on f64 takes a different branch (__nv_sin vs __nv_sinf). f16 is relevant because BinaryOpStmt::visit has special f16 widening/narrowing around the codegen that interacts with the disable_fast_math call placement. |
| 11 | Nested qd.precise calls |
qd.precise(qd.precise(expr)) is never tested. The walker is idempotent so this should be harmless, but it's an untested edge case. |
| 12 | qd.precise on integer expressions |
Not tested. The code paths in alg_simp.cpp carefully check is_integral(...) to preserve integer folds even when precise is set. A test wrapping integer arithmetic in qd.precise would validate that integer simplifications are not accidentally suppressed. |
| 13 | pow precise-bail |
Deliberately omitted with a well-reasoned comment (lines 223-227 of the test file) explaining that the rewrite targets are IEEE-equivalent and no runtime observable difference exists today. Acceptable as-is, but the guard is purely future-proofing and could silently break without notice. |
| 14 | Walker stopping at loads, constants, func calls | Partially covered: the 2Sum test uses loads and qd.func calls as boundaries, so if the walker incorrectly descended into them it would hit a crash or type error. But there is no explicit test that e.g. a qd.func call site's inner body remains non-precise when only the call site is wrapped. The doc says "semantics inside a qd.func body are governed by that body's own ops" — this contract is not tested. Atomic ops as a walker boundary are also untested. |
Summary
The tests are well-designed for the happy path: the Dekker 2Sum test is the right canonical workload and the signed-zero select test is clever. However, coverage is narrow along two axes:
-
Breadth of simplification rules: Only 2 of ~7 modified
alg_simprules are observably tested (a + 0anda - a). The rest (0 * a,a / a,a / const,2*atag propagation, comparisons) have no targeted tests. Several of these have clean IEEE-observable differences (NaN propagation for0 * NaN,0/0,NaN == NaN) that would make good test cases. -
Backend-specific codegen: CUDA libdevice selection and SPIR-V
NoContractiondecoration are only tested when those backends happen to be active. On CPU-only CI, those code paths get zero coverage. A SPIR-V IR unit test that checks for the decoration would be especially valuable sinceNoContractionis the only mechanism protecting compensated arithmetic on Metal/Vulkan. -
Type coverage: f64 and f16 are entirely untested, yet both have distinct codegen branches that interact with the precise flag.
The binary_op_simplify.cpp changes (both the try_rearranging_const_rhs precise bail and the early-return gate) have no targeted tests. These are the most likely to regress silently since the bail conditions are subtle (they depend on the inner LHS being a precise BinaryOpStmt with a const RHS).
Test factorization analysis
The test file has four tests in ~227 lines. The structure is reasonable for a new feature, but there are concrete opportunities to reduce duplication without sacrificing clarity.
1. Unary rounding test: parametrize over functions
test_qd_precise_unary_rounding tests sin, cos, log, and sqrt in a single kernel, computing all four in lockstep and comparing against a stacked numpy reference. This "batch" design means a failure in any one function fails the whole test with a single max-ULP number, making triage harder.
Suggestion: Use @pytest.mark.parametrize to run each function independently. The repo already does this in test_scalar_op.py (lines 51–65) where binary_func_table / unary_func_table are parametrized. Here the kernel would take the op as a qd.static(...) parameter:
@pytest.mark.parametrize("op_name", ["sin", "cos", "log", "sqrt"])
@test_utils.test(default_fp=qd.f32, fast_math=True)
def test_qd_precise_unary_rounding(op_name):
qd_op = getattr(qd, op_name)
np_op = getattr(np, op_name)
@qd.kernel
def k(x: qd.types.ndarray(qd.f32, ndim=1),
out: qd.types.ndarray(qd.f32, ndim=1)):
for i in range(x.shape[0]):
out[i] = qd.precise(qd_op(x[i]))
xs = np.array([0.5, 1.5, 2.5, 4.0, 7.0, 10.0, 25.0, 50.0], dtype=np.float32)
in_arr = qd.ndarray(dtype=qd.f32, shape=(len(xs),))
in_arr.from_numpy(xs)
out = qd.ndarray(dtype=qd.f32, shape=(len(xs),))
k(in_arr, out)
res = out.to_numpy()
ref = np_op(xs.astype(np.float64)).astype(np.float32)
ulp = np.spacing(np.maximum(np.abs(ref), np.float32(1.0)))
max_ulp = float(np.max(np.abs(res - ref) / ulp))
assert max_ulp <= 2.0, f"qd.precise(qd.{op_name}(x)) off by {max_ulp:.2f} ULP"This gives per-function pass/fail reporting and is trivially extensible (add "exp", "tan", etc.).
2. Unify two_sum_naive/two_sum_precise and fast_two_sum_naive/fast_two_sum_precise with qd.static()
The main test defines four @qd.func helpers — two_sum_naive, two_sum_precise, fast_two_sum_naive, fast_two_sum_precise — that are structurally identical except for whether each op is wrapped in qd.precise(). The repo already uses if qd.static(...) inside kernels/funcs for exactly this kind of compile-time branching (see test_static.py, test_cache.py, bls_test_template.py).
Suggestion: Collapse to two funcs using a use_precise: qd.template() parameter:
@qd.func
def two_sum(a, b, use_precise: qd.template()):
if qd.static(use_precise):
s = qd.precise(a + b)
bb = qd.precise(s - a)
aa = qd.precise(s - bb)
e = qd.precise((a - aa) + (b - bb))
else:
s = a + b
bb = s - a
aa = s - bb
e = (a - aa) + (b - bb)
return s, e
@qd.func
def fast_two_sum(a, b, use_precise: qd.template()):
if qd.static(use_precise):
s = qd.precise(a + b)
e = qd.precise(b - (s - a))
else:
s = a + b
e = b - (s - a)
return s, eThis removes two of the four helper functions and makes the naive-vs-precise symmetry explicit instead of implicit. The two kernel functions (df_accum_naive, df_accum_precise) would similarly collapse into one kernel that takes use_precise: qd.template().
However, there is a counter-argument: the qd.static version still duplicates the expression bodies inside the if/else branches, so the line savings are modest. The current form arguably makes the "what does precise protect?" story easier to read side-by-side. This is a judgment call — I'd suggest unifying only if the test file grows more variants (e.g. two_sum_f64, different accumulation orders), where the current copy-paste approach would scale badly.
3. The two kernels df_accum_naive/df_accum_precise could be unified
Same pattern as above. These are identical except for calling the _naive vs _precise helpers and wrapping e + lo in qd.precise. A single kernel parametrized by use_precise: qd.template() would cut ~20 lines:
@qd.kernel
def df_accum(in_arr: qd.types.ndarray(qd.f32, ndim=1),
out: qd.types.ndarray(qd.f32, ndim=1),
use_precise: qd.template()):
for _ in range(1):
hi = qd.f32(1.0)
lo = qd.f32(0.0)
for i in range(N):
s, e = two_sum(hi, in_arr[i], use_precise)
if qd.static(use_precise):
e = qd.precise(e + lo)
else:
e = e + lo
hi, lo = fast_two_sum(s, e, use_precise)
out[0] = hi
out[1] = lo
df_accum(in_arr, out_naive, False)
df_accum(in_arr, out_precise, True)4. Extract shared ndarray setup
The pattern of creating input/output ndarrays, calling a kernel, and reading back numpy results appears in every test. A small helper would reduce boilerplate:
def _make_f32(shape, fill=None, data=None):
arr = qd.ndarray(dtype=qd.f32, shape=shape)
if data is not None:
arr.from_numpy(data)
elif fill is not None:
arr.from_numpy(np.full(shape, fill, dtype=np.float32))
return arrThis is minor (saves ~3 lines per test), but every other test file in the repo seems to repeat the same pattern, so it's not really a divergence from project style. Not a strong recommendation.
5. Parametrize naive vs. precise comparison in the main test? Not recommended.
One might be tempted to @pytest.mark.parametrize("use_precise", [True, False]) on the main 2Sum test and assert different things per variant. But the test's point is to run both in the same qd.init session so the offline-cache-key distinction is validated (running naive first populates the cache; running precise second must not reuse that artifact). Splitting into two parametrized invocations would lose that coverage because @test_utils.test calls qd.init separately per invocation. Keep this as a single test.
6. Summary of concrete recommendations
| Change | Lines saved (approx.) | Readability impact | Recommended? |
|---|---|---|---|
Parametrize unary test over sin/cos/log/sqrt |
~15 | Better: per-op pass/fail | Yes |
Unify two_sum_naive/two_sum_precise via qd.static |
~15 | Neutral to slightly worse | Soft yes — do it if more variants are coming |
Unify df_accum_naive/df_accum_precise via qd.static |
~20 | Neutral | Soft yes, same caveat |
| Extract ndarray helper | ~10 | Marginal | No — not worth a project-wide pattern break |
| Parametrize naive-vs-precise as two test cases | 0 | Worse: breaks cache-key coverage | No |
The strongest win is (1): parametrize the unary test. It costs nothing, improves diagnostics, and matches existing repo conventions (test_scalar_op.py). Items (2)–(3) are nice-to-haves if the file grows; the current duplication is tolerable at this size.
UnaryOpExpression (frontend_ir.h:375–377) |
|
Lets' address also:
else { else {QD_NOT_IMPLEMENTED} ir_->register_value(bin_name, bin_value);
a ** 0.5 → sqrt(a) (line 186): creates UnaryOpStmt(sqrt) without copying precise.
if (stmt->precise) { if (!stmt->precise) { For this one, not necessarily requesting it to be impelmented, but would at least like you to acknowledge and respond to it please:
if (stmt->precise) { As written, qd.precise(x ** 4) emits a pow(x, 4) libcall/intrinsic at runtime instead of three multiplies — that is an order-of-magnitude slower on every backend (GPU pow is ~50 cycles vs. ~12 cycles for three FMULs). Users of compensated arithmetic do use small-integer powers, so this is likely to bite. Recommendation: Propagate precise onto synthesized stmts in exponent_n_optimize, exponent_half_optimize, exponent_negative_optimize (as is already done for the alg_is_two → add rewrite), and remove the blanket bail-out. Not sure if the following is still true in LLVM 22. Also not sure if this will affect code correctness. Would at least like you to acknowledge and respond to it please:
Recommendation: Verify which LLVM version the project targets and switch to copyFastMathFlags(FastMathFlags{}) if available. Not requesting the following, but seems like the kind of thing you might find interesting?
This runs at kernel-definition time (Python → AST), not at GPU runtime, and expression trees are typically small (a handful of nodes per qd.precise() call). So the absolute cost is negligible. But if this ever needs to scale (e.g., wrapping large generated expression trees): Use raw Expression* in the worklist instead of Expr to avoid atomic ref-count churn. This sounds worht doing? :
Recommendation: Move the body to the header as an inline method, or mark it attribute((always_inline)). This seems worth checking:
void visit(BinaryOpStmt stmt) override { The doc table (line 104) claims AMDGPU has precise support ("LLVM FMF cleared") but no code changes were made to codegen_amdgpu.cpp. The base-class TaskCodeGenLLVM handles normal arithmetic, but emit_extra_unary is completely overridden in AMDGPU and never checks stmt->precise — the _ocml* functions are always called unconditionally. Unlike CUDA (which has _nv_fast* variants), AMDGPU's _ocml* functions are already correctly-rounded, so this is not a behavioral bug today. But if the compiler ever adds _ocml_fast* variants, this code path has no precise guard to prevent selecting them.
} else if (stmt->op_type == UnaryOpType::inv) {
void visit(BinaryOpStmt *stmt) override { This seems important to me, though might need revisiting architecture? :
"If you alias a subexpression and then wrap one alias in qd.precise, both uses get IEEE semantics." This is documented, but it's still surprising behavior that could cause subtle bugs. Consider: ab = a + b
Suggestion: Use @pytest.mark.parametrize to run each function independently. The repo already does this in test_scalar_op.py (lines 51–65) where binary_func_table / unary_func_table are parametrized. Here the kernel would take the op as a qd.static(...) parameter:
Suggestion: Collapse to two funcs using a use_precise: qd.template() parameter:
|
|
Opus assessment on what has changed relative to my requests: ADDRESSED (done in the latest update):
──────────────────────────────────────── PARTIALLY ADDRESSED: ──────────────────────────────────────── NOT ADDRESSED (acknowledged or deferred): |
|
Looks like a bunch of things have not been addressed yet? Of these, I think the most critical is the current design decision to modify in-place, giving uinuitive aliasing effects. Would also like the tests to be factorized to use parametrize. |
There was a problem hiding this comment.
LGTM — the inline nit about SPIR-V NoContraction on atan2/pow is the only remaining item; as the PR itself documents, NoContraction scopes to arithmetic instructions per spec, so its absence on OpExtInst transcendentals is a known best-effort limitation rather than a regression.
Extended reasoning...
Overview
PR adds qd.precise(expr) across 23 files: Python API, frontend IR, backend IR (both Expression and Stmt layers), three codegen backends (LLVM/CUDA/AMDGPU/SPIR-V), two transform passes (alg_simp, binary_op_simplify, demote_operations), offline cache key, and tests + docs.
Security risks
None. This is a compile-time IR transformation that adds a boolean flag and gates fast-math rewrites on it. No auth, crypto, permissions, or external I/O involved.
Level of scrutiny
This PR warranted — and received — high scrutiny. It touches critical FP codegen paths on every backend. Through multiple review rounds, nine distinct correctness bugs were identified and fixed (unary pipeline, 2×a strength reduction, a-0 gate, pow/exponent synthesized-stmt tagging, demote_operations pow and ffloor, f16 rsqrt intermediate, f16 atan2/pow, AMDGPU override, SPIR-V inv). All were addressed in the current diff. The inline nit (SPIR-V NoContraction not emitted for atan2/pow via FLOAT_BINARY_OP_TO_SPIRV_FLOAT_FUNC) has zero functional impact — the PR description and code comments explicitly acknowledge that NoContraction scopes to arithmetic instructions and is best-effort on OpExtInst transcendentals.
Other factors
- Seven distinct test functions covering the primary use case, unary rounding, walker boundaries (select, bit_cast, qd.func call sites, aliasing), idempotency under fast_math=False, and the offline-cache-key distinction.
- User-facing documentation (precise.md) linked from the user guide, with an honest backend coverage table.
- The PR author took full personal responsibility for the code at 16:26 and has been extremely responsive throughout the review cycle.
…(atan2/pow) results with NoContraction
… before f16 FPTrunc
…ant mutation test, fix pow NOTE
…orator; reframe as contract check
…ration in visit(BinaryOpStmt)
…inary/Unary stmts
…rop duplicate builder-side precise
…+0 folds unconditionally
…stop, qualify doc/test idempotency claims
…mt; align ops.py docstring with precise.md
…-gated, not fast_math-gated
7fc4b56 to
cafb630
Compare
|
This feature is fully ready but no longer actively required in Genesis. Closing for now, as requested by @hughperkins, the current official Quadrants's maintainer, in accordance with the current merging policy. |
|
Please could you consider reducing the 'point of contact' between the experimental code in this branch, and the commonly used code-paths? |
There was a problem hiding this comment.
Two correctness bugs found; deferring for author fix before merge.
Extended reasoning...
Overview
PR #476 adds qd.precise(...) for per-op IEEE-strict FP across 26 files, touching IR, codegen (LLVM, CUDA, AMDGPU, SPIR-V), algebraic simplification passes, and tests. The scope is non-trivial.
Bugs Found
Two correctness issues were identified by the automated review system and are flagged as inline comments on the relevant lines:
-
exponent_half_optimizeviolates the precise contract fora = -0.0(quadrants/transforms/alg_simp.cpp): The rewritea ** 0.5 → sqrt(a)is applied unconditionally even whenstmt->preciseis set. While the PR correctly propagatesresult->precise = stmt->preciseto clear FMF on the synthesizedsqrt, it does not prevent the semantic change: per IEEE 754 §9.2,pow(-0.0, 0.5) = +0.0, but per §5.4.1,sqrt(-0.0) = -0.0. Code callingqd.precise(x ** 0.5)withx = -0.0will silently get the wrong sign bit. Fix: addif (stmt->precise) return false;at the top ofexponent_half_optimize(). -
Three signed-zero tests missing
archrestriction (tests/python/test_precise.py):test_qd_precise_recurses_through_select,test_qd_precise_recurses_through_bit_cast, andtest_qd_precise_stops_at_qd_func_callassertprecise_bits == 0x00000000(i.e.(-0.0f) + 0.0f = +0.0f), which requires IEEE signed-zero preservation. SPIR-V'sNoContractiondecoration only prevents FMA contraction; without emitting theSignedZeroInfNanPreserveexecution mode, conformant Vulkan/Metal drivers may return-0.0, causing spurious CI failures. The fix is to addarch=[qd.cpu, qd.cuda, qd.amdgpu]to all three test decorators, matching the pattern already used elsewhere in the file.
Security Risks
None. This is a floating-point precision primitive; no auth, crypto, or untrusted-input handling is involved.
Level of Scrutiny
The changes are substantial (26 files, new IR field, codegen changes across all backends) but well-structured. The two bugs are confined to a single optimization gate and three test decorators — both straightforward to fix. The rest of the PR looks mechanically correct given the design. Human review is still warranted for the overall approach, but these two issues should be addressed first.
Other Factors
No prior review activity on this PR. The PR has thorough inline documentation and tests covering the happy path. The missing guard in exponent_half_optimize is particularly subtle because the comment even claims the rewrite is "IEEE-equivalent" — which is true for all inputs except -0.0.
| } | ||
| } | ||
| auto a = stmt->lhs; | ||
| cast_to_result_type(a, stmt); | ||
| auto result = Stmt::make<UnaryOpStmt>(UnaryOpType::sqrt, a); | ||
| cast_to_result_type(a, stmt, stmt->precise); | ||
| auto result = Stmt::make_typed<UnaryOpStmt>(UnaryOpType::sqrt, a); | ||
| result->ret_type = a->ret_type; | ||
| // `a ** 0.5 -> sqrt(a)` is IEEE-equivalent, but the synthesized sqrt must carry `precise` so codegen clears FMF on | ||
| // it; otherwise `qd.precise(x ** 0.5)` silently gets `afn`-approximated. | ||
| result->precise = stmt->precise; | ||
| stmt->replace_usages_with(result.get()); | ||
| modifier.insert_before(stmt, std::move(result)); | ||
| modifier.erase(stmt); |
There was a problem hiding this comment.
🔴 exponent_half_optimize() applies a**0.5 → sqrt(a) unconditionally even when stmt->precise is set, violating the qd.precise contract: pow(-0.0, 0.5) = +0.0 per IEEE 754, but sqrt(-0.0) = -0.0, so the substitution silently changes the sign bit. Add if (stmt->precise) return false; at the top of exponent_half_optimize() to skip the rewrite under precise mode.
Extended reasoning...
Bug summary
exponent_half_optimize() in alg_simp.cpp rewrites a**0.5 to sqrt(a) without consulting stmt->precise. The qd.precise contract promises that every algebraic identity rewrite is suppressed under precise mode so programs using qd.precise get bitwise-identical results to the unoptimized IEEE evaluation. This substitution breaks that promise for the specific input a = -0.0.
IEEE 754 signed-zero discrepancy
IEEE 754-2008 §9.2 (pow) defines pow(-0.0, 0.5) = +0.0 because pow treats -0.0 as a magnitude-zero positive number for half-integer exponents with even denominator. IEEE 754-2008 §5.4.1 (squareRoot) defines squareRoot(-0.0) = -0.0 because sqrt propagates the sign bit. These are both correct IEEE results for their respective operations, but they disagree on the sign bit. The rewrite therefore changes observable behavior for any program that relies on the sign of pow(-0.0, 0.5).
Why result->precise = stmt->precise does not fix this
The PR correctly propagates the precise flag to the synthesized UnaryOpStmt so codegen calls disable_fast_math() on the sqrt instruction, preventing the LLVM afn (approximate functions) fast-math flag from being applied. This addresses the codegen-level FMF concern but is orthogonal to the semantic issue: even with every fast-math flag cleared, sqrt(-0.0) hardware instruction still returns -0.0 on all IEEE-compliant targets (x86, RISC-V, SPIR-V, PTX). Clearing FMF cannot change the fact that the substituted operation computes a semantically different function on -0.0.
Concrete proof of the violation
Consider the following sequence under qd.precise mode:
- User writes: qd.precise(a ** 0.5) where a = -0.0 (float32 bit pattern 0x80000000)
- exponent_half_optimize fires because get_exponent_values() returns {0.5} and the 0.5 check passes
- The BinaryOpStmt pow is replaced with a UnaryOpStmt sqrt carrying precise=true
- Codegen emits: sqrt(-0.0) with FMF cleared → result = -0.0 (0x80000000)
- Expected per IEEE 754 pow semantics: pow(-0.0, 0.5) = +0.0 (0x00000000)
The user observes +0.0 becoming -0.0 — a sign-bit flip that strict IEEE arithmetic must not introduce.
False claim in source comment and test
alg_simp.cpp line 191 currently states "a ** 0.5 -> sqrt(a) is IEEE-equivalent". This is factually incorrect for a = -0.0. test_precise.py lines 479–483 contain a NOTE repeating the same claim and citing it as justification for omitting a behavioral test for this rewrite; that NOTE should be removed or corrected once the guard is added.
Fix
Add a guard at the start of exponent_half_optimize():
0 ""
0 ""
0 ""
1 "/usr/include/stdc-predef.h" 1 3 4
0 "" 2
1 ""
This is the minimal, conservative fix: precise programs skip the rewrite entirely and the codegen path for pow with precise=true already handles FMF clearing. The test NOTE in test_precise.py should also be updated to reflect that the a**0.5 rewrite is intentionally suppressed under precise mode.
|
Chatted with Opus for a while, about reducing point of contact. It came up with: Thoughts? |
|
created a branch to try to reduce contact points https://github.com/Genesis-Embodied-AI/quadrants/compare/experimental...hp/precise-pt_contact?expand=1 contact points still seem too broad to me. might prod opus a bit more to shrink these, at some point (eg cuda codegen has precise-specific stuff added, forexample) (Note to self: implemented by deskai4; 8h30 ET apr 16 2026) |
qd.precise(...): per-op IEEE-strict FP for compensated arithmeticTL;DR
Mirrors the
precisekeyword in MSL/HLSL. Compensated-arithmetic kernels (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) now have a clean way to opt out of fast-math at the granularity of a single expression instead of having to flip the globalfast_math=Falseswitch and pay the perf cost everywhere else.Why
Quadrants compiles kernels with
fast_math=Trueby default. Under that mode every backend's compiler is free to:(a + b) + c → a + (b + c))sqrt,sin,cos,log,1/x(varies by backend)0 * a → 0,a / a → 1,a - a → 0,a == a → true)This silently destroys compensated-arithmetic primitives whose entire correctness is built on the fact that
(a - aa) + (b - bb)is non-zero in IEEE arithmetic. The traditional workaround — globally flippingfast_math=False— is too coarse: it costs perf across every kernel even when only a handful of lines need IEEE semantics.qd.precise(expr)is the per-op opt-in.Surface API
Recursively tags every
BinaryOpExpressionandUnaryOpExpressioninobj's subtree with apreciseflag. Recursion descends through:BinaryOpExpression(lhs, rhs)— tags itself, recurses into both operands.UnaryOpExpression(operand)— tags itself, recurses into the operand (coverscast,bit_cast,neg,sqrt,sin,cos,log,exp,rsqrt,inv, ...).TernaryOpExpression(op1, op2, op3)— recurses into all three (no precise field on it; just keeps inner ops reachable throughqd.select(...)wrappers).Stops at any other Expression kind: loads, constants, ndarray accesses,
qd.funccalls, etc. Semantics inside aqd.funcbody are governed by that body's own ops — wrap calls separately if needed.Tagging is in-place on the underlying shared
Expressionnodes, so if you alias a subexpression and then wrap one alias, both uses get IEEE semantics (intentional — the tag is a property of the value, not the use site).The Python entry point is in
python/quadrants/lang/ops.py; bound to C++ via_qd_core.preciseinquadrants/python/export_lang.cpp; backed byExpr precise(const Expr&)inquadrants/ir/expr.cpp.Mechanism end-to-end
1. AST + IR — one bool field, two layers
quadrants/ir/frontend_ir.hBinaryOpExpression::precise,UnaryOpExpression::precisequadrants/ir/statements.hBinaryOpStmt::precise,UnaryOpStmt::preciseBoth IR fields are added to
QD_STMT_DEF_FIELDS(...)so:clone()(via copy-ctor) carries them automatically.same_statementsIR-equality picks them up viafield_manager.equal().WholeKernelCSEtherefore won't merge a precise op with a non-precise twin.Propagation from AST to IR happens in
BinaryOpExpression::flatten()andUnaryOpExpression::flatten()(quadrants/ir/frontend_ir.cpp) — setstmt->precise = expr->preciseon the freshly-builtBinaryOpStmt/UnaryOpStmtbeforectx->push_back.2. Walker —
precise()free functionquadrants/ir/expr.cpp::precise(const Expr&). A worklist walk (not recursive, to keep stack depth bounded for deep AST chains common in scientific code) that descends through Binary/Unary/Ternary nodes and setsprecise = trueon Binary/Unary along the way.3. Transforms — gate FP folds on
!stmt->precisequadrants/transforms/alg_simp.cpp0 * a → 0,a / a → 1,a / const → a * (1/const),a - a → 0,a -^ a → 0,a == a → 1,a != a → 0quadrants/transforms/binary_op_simplify.cpptry_rearranging_const_rhsand the rest of the FP simplification bodyEach previously-fast-math-gated branch now also checks
&& !stmt->precise. The integer paths are untouched (precise has no meaning for integer ops; the field is ignored for them everywhere).4. Codegen — three backends, two mechanisms
LLVM (CPU / CUDA / AMDGPU) — per-instruction FMF clear
At the bottom of
TaskCodeGenLLVM::visit(BinaryOpStmt*)andvisit(UnaryOpStmt*)(quadrants/codegen/llvm/codegen_llvm.cpp):The IRBuilder default-stamps fast-math flags on every FP instruction it creates (
afn nnan ninf nsz reassoc). Forpreciseops we zero them out individually after construction.This handles all FP intrinsics:
llvm.sqrt.f32, FAdd/FSub/FMul/FDiv, CreateMaxNum/MinNum, FCmp comparisons, FMA contractions.SPIR-V (Metal / Vulkan, including MoltenVK) — per-op
NoContractiondecorationquadrants/codegen/spirv/spirv_ir_builder.{h,cpp}— added abool precise = falseparameter toadd/sub/mul/div/modand amaybe_no_contraction(Value v, bool precise)helper that emitsOpDecorate ... NoContractionwhenpreciseis true.quadrants/codegen/spirv/spirv_codegen.cpp::visit(BinaryOpStmt*)andvisit(UnaryOpStmt*)threadbin->precise/stmt->preciseinto the appropriate call sites.SPIRV-Cross translates
NoContractioninto MSL'sprecisequalifier when it generates MSL — which is what MoltenVK uses internally on macOS. So this single decoration mechanism covers both native Vulkan drivers and the MoltenVK-on-Metal path.CUDA libdevice — explicit fast/precise selection
CUDA is the only backend with explicit fast-vs-precise function-call selection that bypasses LLVM FMF. In
quadrants/codegen/cuda/codegen_cuda.cpp,qd.sin/cos/logfor f32 are emitted as direct calls to either__nv_fast_sinf(~few ULP, degrades sharply outside (-π/2, π/2)) or__nv_sinf(correctly rounded). LLVM FMF on the call instruction is meaningless because it's a plain function call; the function name itself encodes the approximation level.Patched all three sites:
Other backends are unaffected by this class of issue:
qd.sin/cos/log/expto libc calls (sinf/...) which are correctly rounded by default;qd.sqrtlowers tollvm.sqrt.f32(handled by the FMF clear).__ocml_*(no__ocml_native_*fast variants in the codegen).OpExtInst GLSL.std.450 Sin/Cos/Log/.... TheNoContractiondecoration is emitted but per the SPIR-V spec it scopes to arithmetic instructions, so most consumers (including SPIRV-Cross's MSL path) will ignore it on ExtInst results. On those paths correctness relies on the driver's default (non-fast-math) stdlib, which is typically within 1-2 ULP for in-range inputs. The decoration is kept as best-effort future-proofing.5. Offline cache key — content-addressed, must include the new field
quadrants/analysis/gen_offline_cache_key.cpp— the AST-level cache key generator emitsexpr->precisefor bothBinaryOpExpressionandUnaryOpExpression. Without this, two structurally-identical kernels (one withqd.precise(...)wrappers, one without) would hash to the same key, and the second to compile would silently reuse the first's compiled artifact. The merged binary test catches this regression directly.Per-backend coverage matrix
sin/cos/log)sinfis correctly rounded — N/Acodegen_cuda.cpplibdevice selection ✓__ocml_*already correctly rounded — N/ANoContractiondecoration ✓NoContraction→ MSLprecise✓Tests —
tests/python/test_precise.pyTwo tests, both run on every available backend.
test_qd_precise_protects_fast_math— observable on all backendsDefines a Dekker 2Sum accumulator (1000 × 1e-8 added into 1.0) twice in the same test:
df_accum_naive— unprotected; underfast_math=Truethe compensation termlocollapses to zero (this is the bugqd.preciseexists to fix; asserted as the negative control).df_accum_precise— every FP op wrapped inqd.precise(...); assertslois non-trivially non-zero AND thathi + lobeats the naïve f32 sum by >1000×.The same test also implicitly validates the offline-cache key fix: the two kernels are structurally identical apart from the
qd.precise(...)wrappers, so ifgen_offline_cache_key.cppdidn't account for theprecisefield, the second compile would silently reuse the first's artifact and the precise version would behave like the naive — comment in the test spells this out so future readers don't accidentally split the kernels back into separate tests and lose the coverage.test_qd_precise_unary_rounding— verifies the unary path on every backendComputes
qd.precise(qd.sin/cos/log(x))for inputs[0.5, 1.5, 2.5, 4.0, 7.0, 10.0, 25.0, 50.0](chosen to span both the central range AND values where some backends' fast-math approximations are known to degrade). Asserts the result is within 2 ULP of the correctly-rounded f32 reference computed via numpy in f64 then narrowed.The naive (non-precise) variant is deliberately not part of this test — on most backends
fast_math=Truealready produces correctly-rounded transcendentals, so a naive-vs-precise comparison would be uninformative. The test instead asks the right question directly: "isqd.precise(unary)correctly rounded?"Side-effect audit (what can't go sideways)
I went through every place in the codebase where the IR gets hashed, compared, serialized, or content-deduped. Findings:
gen_offline_cache_key.cpppython/quadrants/lang/_fast_caching/src_hasher.pyqd.precise(...)callsQD_STMT_DEF_FIELDS(..., precise)+ auto copy-ctorsame_statements)field_manager.equal()seespreciseviaQD_STMT_DEF_FIELDStransforms/whole_kernel_cse.cppfalls back tosame_statementsfor equalityprecise=false, gates everywheretransforms/ir_printer.cpp,expression_printer.hprecise— pure debug visibility, not load-bearing