-
Notifications
You must be signed in to change notification settings - Fork 17
[Lang] Add qd.precise(...) for per-op IEEE-strict FP. #476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
duburcqa
wants to merge
41
commits into
experimental
Choose a base branch
from
duburcqa/qd_precise
base: experimental
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
ba88c69
[Lang] Add qd.precise(...) for per-op IEEE-strict FP
duburcqa d14f322
[Lang] qd.precise: cover UnaryOpStmt as well
duburcqa 1898a31
[Lang] qd.precise: address self-review feedback
duburcqa 450fb93
[Lang] qd.precise: gate alg_simp folds, cover sqrt, DRY CUDA libdevice
duburcqa fdeb1ea
[Lang] qd.precise: scrub non-ASCII from comments
duburcqa 6180f04
[Lang] qd.precise: replace -- with single - in comments
duburcqa 9bb5342
[Doc] User guide entry for qd.precise
duburcqa 8abb2b3
[Lang] qd.precise: factor disable_fast_math helper, add Vector/select…
duburcqa cc68a95
[Lang] qd.precise: propagate tag in 2*a rewrite, narrow zero-fold gat…
duburcqa c4a8dac
[Lang] qd.precise: use make_typed to avoid downcast on synthesized 2*…
duburcqa 29fb886
Cleanup doc.
duburcqa 3841fea
[Lang] qd.precise: cover walker boundaries (qd.func, bit_cast, alias,…
duburcqa 21f20a9
[Lang] qd.precise: fix docstring to mention unary FP ops and approxim…
duburcqa b8ec4f8
[Lang] qd.precise: unify precise field comments via canonical referen…
duburcqa 8601d6f
[Lang] qd.precise: propagate tag through synthesized stmts in alg_sim…
duburcqa 6f30d28
[Lang] qd.precise: clear LLVM FMF on intermediate and pre-FPTrunc values
duburcqa 3af2f9f
[Lang] qd.precise: SPIR-V inv forwards precise, inline maybe_no_contr…
duburcqa d4ffbe8
[Lang] qd.precise: drop bit-ops-on-FP from doc; align __all__ positio…
duburcqa bc3c358
[Lang] qd.precise: clone input subtree instead of mutating in-place; …
duburcqa 8a58940
[Lang] qd.precise: parametrize unary rounding test per op for per-op …
duburcqa cf9023a
[Lang] qd.precise: SPIR-V visit(BinaryOpStmt) tags FP transcendental …
duburcqa 4259432
[Lang] qd.precise: reflow PR-introduced C++ comments to 120 cols
duburcqa 0c47065
[Lang] qd.precise: propagate tag through cast in 2*a rewrite (and ref…
duburcqa 41801a7
[Lang] qd.precise: CUDA emit_extra_unary clears FMF on libdevice call…
duburcqa e01778b
[Lang] qd.precise: skip sin/cos unary-rounding on SPIR-V, drop redund…
duburcqa 5a2dbb9
[Lang] qd.precise: unary-rounding test restricts to LLVM via arch dec…
duburcqa e3196b7
[Lang] qd.precise: type_check propagates tag through implicit operand…
duburcqa 8e52ee1
[Lang] qd.precise: document SPIR-V arithmetic/post-hoc two-layer deco…
duburcqa 4aa6c7f
[Lang] qd.precise: scalarize propagates tag onto per-element scalar B…
duburcqa 14fb6ca
[Lang] qd.precise: SPIR-V decorates FP ops once via post-hoc block; d…
duburcqa 7f34d62
[Lang] qd.precise: idempotency test also covers AMDGPU (also an LLVM …
duburcqa 5676eb8
[Lang] qd.precise: AMDGPU i32 pow clears FMF on __ocml_pow_f64 call b…
duburcqa 43c4367
[Lang] qd.precise: exclude cmp_gt/cmp_lt from precise guard (IEEE-fal…
duburcqa 85fbb6c
[Lang] qd.precise: iterative worklist in clone_and_tag_precise (O(1) …
duburcqa 94fbfc5
[Lang] qd.precise: precise_fp_add requires FP operand type; integer a…
duburcqa b519f33
[Lang] qd.precise: fix same_operation comment, document IdExpression …
duburcqa 0eb62de
[Lang] qd.precise: IR printer annotates [precise] on Unary/BinaryOpSt…
duburcqa acdcfbd
[Lang] qd.precise: fix op count in precise.md example comment (three …
duburcqa 426198e
[Lang] qd.precise: add rsqrt to unary-rounding test; add floordiv con…
duburcqa cafb630
[Lang] qd.precise: fix fast_math=False table row; a+0 fold is precise…
duburcqa 6712b0c
Merge branch 'experimental' into duburcqa/qd_precise
hughperkins File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,6 +19,7 @@ scalar_tensors | |
| matrix_vector | ||
| compound_types | ||
| static | ||
| precise | ||
| sub_functions | ||
| parallelization | ||
| ``` | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| # qd.precise | ||
|
|
||
| `qd.precise(expr)` marks a floating-point expression as IEEE-strict. Every binary and unary FP op inside the wrapped subtree is evaluated in source order with no reassociation, no FMA contraction, and no non-IEEE-exact algebraic simplification, regardless of the module-level `fast_math` setting. Folds that are IEEE-exact for every input (e.g. `a - 0 -> a`, `a > a -> false`) are still applied. It is equivalent to the `precise` keyword in MSL / HLSL. | ||
|
|
||
| ## Why | ||
|
|
||
| Quadrants compiles kernels with `fast_math=True` by default. Under that mode the compiler is free to: | ||
|
|
||
| - **reassociate** FP ops (e.g. `(a + b) + c -> a + (b + c)`) | ||
| - **contract** mul-then-add into FMA | ||
| - **substitute approximations** for `sqrt`, `sin`, `cos`, `log`, `1/x` | ||
| - **algebraically simplify** (e.g. `a - a -> 0`, `a / a -> 1`) | ||
|
|
||
| This silently destroys compensated-arithmetic primitives (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) whose entire correctness rests on the fact that `(a - aa) + (b - bb)` is non-zero under IEEE arithmetic. The traditional workaround is to flip the global `fast_math=False` switch, but that pays the perf cost everywhere, even when only a handful of lines need IEEE semantics. | ||
|
|
||
| `qd.precise(expr)` is the per-expression opt-in: keep `fast_math=True` globally for speed, and wrap the expressions that must be IEEE-exact. | ||
|
|
||
| ## Basic usage | ||
|
|
||
| ```python | ||
| @qd.func | ||
| def fast_two_sum(a, b): | ||
| s = qd.precise(a + b) | ||
| e = qd.precise(b - (s - a)) # would fold to 0 under fast-math without precise | ||
| return s, e | ||
| ``` | ||
|
|
||
| Any expression value can be wrapped. The wrapper returns the same expression with every reachable FP op tagged as precise; at codegen time the tagged ops opt out of the optimizations above. | ||
|
|
||
| ## What gets protected | ||
|
|
||
|
claude[bot] marked this conversation as resolved.
|
||
| `qd.precise` walks the wrapped expression tree and tags: | ||
|
|
||
| - Every `BinaryOp` (`+`, `-`, `*`, `/`, `%`, FP comparisons) | ||
| - Every `UnaryOp` (`neg`, `sqrt`, `sin`, `cos`, `log`, `exp`, `rsqrt`, casts, bit_cast, ...) | ||
|
|
||
| Bitwise operations (`bit_and`, `bit_or`, `bit_xor`, `bit_shl`, `bit_sar`) are integer-domain; the walker tags them for completeness but the flag has no effect on integer IR. | ||
|
|
||
| The walker descends through `BinaryOp`, `UnaryOp`, and `TernaryOp` (e.g. `qd.select`) nodes, so wrapping a composite expression protects the inner ops too: | ||
|
|
||
| ```python | ||
| # All four FP ops below are tagged: the outer sqrt, the inner add, and the two inner muls. | ||
| r = qd.precise(qd.sqrt(a * a + b * b)) | ||
|
|
||
| # Ternary is traversed through; the two branches and the condition's inner ops are tagged. | ||
| r = qd.precise(qd.select(cond, a + b, a - b)) | ||
| ``` | ||
|
|
||
| ## Where the walker stops | ||
|
|
||
| `qd.precise` does not descend into: | ||
|
|
||
| - Loads (ndarray indexing, field access) | ||
| - Constants | ||
| - `qd.func` call sites | ||
| - Atomic ops | ||
| - Intermediate Python variable assignments (`tmp = a + b` wraps the RHS in an internal alloca, so `qd.precise(tmp)` sees the alloca, not the inner `BinaryOp`, and is a silent no-op) | ||
|
|
||
| Semantics inside a `qd.func` body are governed by that body's own ops. If you want IEEE-strict behavior inside a called function, wrap the relevant ops inside the function's body, not at the call site. Similarly, wrap `qd.precise` directly around the expression rather than around a variable that was assigned earlier: | ||
|
|
||
| ```python | ||
| @qd.func | ||
| def dot_precise(a, b, c, d): | ||
| # Wrap inside the body, not at the caller. | ||
| return qd.precise(a * b + c * d) | ||
|
|
||
| @qd.kernel | ||
| def k(...): | ||
|
claude[bot] marked this conversation as resolved.
|
||
| r = dot_precise(x, y, z, w) # inner ops are already precise | ||
| ``` | ||
|
|
||
| ## Interaction with fast_math | ||
|
|
||
| `qd.precise` is a per-op override. It takes effect whether `fast_math` is on or off: | ||
|
|
||
| | Setting | Non-precise op | `qd.precise` op | | ||
| |---|---|---| | ||
| | `fast_math=True` | reassoc / contract / simplify | IEEE-strict | | ||
| | `fast_math=False` | mostly IEEE-strict (*) | IEEE-strict | | ||
|
|
||
| (*) Under `fast_math=False` most rewrites are already globally disabled, but the `a + 0 -> a` fold for FP adds is gated on `qd.precise` only (not on `fast_math`), so `(-0.0) + 0.0` still folds to `-0.0` without the tag. `qd.precise` is therefore not fully redundant under `fast_math=False` for code that depends on signed-zero semantics. | ||
|
|
||
| The recommended workflow is to leave `fast_math=True` globally for throughput and reach for `qd.precise` only in the handful of spots that need IEEE behavior. | ||
|
|
||
| ## Backend coverage | ||
|
claude[bot] marked this conversation as resolved.
|
||
|
|
||
| | Backend | Reassoc / contraction / algebraic folds | Approximate transcendentals (`sin` / `cos` / `log`) | | ||
| |---|---|---| | ||
| | CPU | LLVM FMF cleared | libc `sinf` is already correctly rounded | | ||
| | CUDA | LLVM FMF cleared | libdevice `__nv_<fn>f` (non-fast) selected | | ||
| | AMDGPU | LLVM FMF cleared | `__ocml_<fn>` already correctly rounded | | ||
| | Vulkan / MoltenVK | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) | | ||
| | Metal | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) | | ||
|
|
||
| On SPIR-V backends, `NoContraction` is defined by the spec to apply to arithmetic instructions only; most consumers ignore it on the `OpExtInst` calls used for transcendentals. The decoration is still emitted (it is harmless and future-proofs against downstream toolchains that start honoring it), but correctness of `qd.precise(qd.sin(x))` / `qd.precise(qd.cos(x))` on Metal / Vulkan cannot be guaranteed through the tag: the Vulkan precision requirements for GLSL.std.450 `Sin`/`Cos` are stated as 2^-11 absolute error, which on inputs whose reference magnitude is smaller than 1 is thousands of ULPs, and drivers are within their rights to saturate that latitude. If you need correctly-rounded sin/cos, use the CPU / CUDA / AMDGPU backends. | ||
|
|
||
| ## Example: Dekker 2Sum | ||
|
|
||
| A textbook compensated addition that computes `s + e = a + b` exactly in f32: | ||
|
|
||
| ```python | ||
| @qd.func | ||
| def two_sum(a, b): | ||
| s = qd.precise(a + b) | ||
| bb = qd.precise(s - a) | ||
| aa = qd.precise(s - bb) | ||
| e = qd.precise((a - aa) + (b - bb)) | ||
| return s, e | ||
| ``` | ||
|
|
||
| Without the `qd.precise` wrappers, under `fast_math=True` the compiler recognizes `(a - (s - (s - a))) + (b - (s - a))` as algebraically zero and folds `e` to `0`. The wrappers prevent that fold, and `s + e` reproduces `a + b` to full precision. | ||
|
|
||
| ## Caveats | ||
|
|
||
| - `qd.precise` is a scalar primitive. Passing a `Vector` / `Matrix` will raise. Apply it to individual components instead, or refactor your expression to use scalar ops inside. | ||
| - `qd.precise` does not mutate its input. It returns a fresh expression subtree with every reachable FP op tagged; the original expression is unchanged. Reusing the original elsewhere is safe and never inherits the tag. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.