Genesis-Embodied-AI · duburcqa · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
@@ -19,6 +19,7 @@ scalar_tensors
 matrix_vector
 compound_types
 static
+precise
 sub_functions
 parallelization
 ```

diff --git a/docs/source/user_guide/precise.md b/docs/source/user_guide/precise.md
@@ -0,0 +1,116 @@
+# qd.precise
+
+`qd.precise(expr)` marks a floating-point expression as IEEE-strict. Every binary and unary FP op inside the wrapped subtree is evaluated in source order with no reassociation, no FMA contraction, and no non-IEEE-exact algebraic simplification, regardless of the module-level `fast_math` setting. Folds that are IEEE-exact for every input (e.g. `a - 0 -> a`, `a > a -> false`) are still applied. It is equivalent to the `precise` keyword in MSL / HLSL.
+
+## Why
+
+Quadrants compiles kernels with `fast_math=True` by default. Under that mode the compiler is free to:
+
+- **reassociate** FP ops (e.g. `(a + b) + c -> a + (b + c)`)
+- **contract** mul-then-add into FMA
+- **substitute approximations** for `sqrt`, `sin`, `cos`, `log`, `1/x`
+- **algebraically simplify** (e.g. `a - a -> 0`, `a / a -> 1`)
+
+This silently destroys compensated-arithmetic primitives (Dekker / Kahan 2Sum, Veltkamp split, double-single accumulators) whose entire correctness rests on the fact that `(a - aa) + (b - bb)` is non-zero under IEEE arithmetic. The traditional workaround is to flip the global `fast_math=False` switch, but that pays the perf cost everywhere, even when only a handful of lines need IEEE semantics.
+
+`qd.precise(expr)` is the per-expression opt-in: keep `fast_math=True` globally for speed, and wrap the expressions that must be IEEE-exact.
+
+## Basic usage
+
+```python
+@qd.func
+def fast_two_sum(a, b):
+    s = qd.precise(a + b)
+    e = qd.precise(b - (s - a))   # would fold to 0 under fast-math without precise
+    return s, e
+```
+
+Any expression value can be wrapped. The wrapper returns the same expression with every reachable FP op tagged as precise; at codegen time the tagged ops opt out of the optimizations above.
+
+## What gets protected
+
+`qd.precise` walks the wrapped expression tree and tags:
+
+- Every `BinaryOp` (`+`, `-`, `*`, `/`, `%`, FP comparisons)
+- Every `UnaryOp` (`neg`, `sqrt`, `sin`, `cos`, `log`, `exp`, `rsqrt`, casts, bit_cast, ...)
+
+Bitwise operations (`bit_and`, `bit_or`, `bit_xor`, `bit_shl`, `bit_sar`) are integer-domain; the walker tags them for completeness but the flag has no effect on integer IR.
+
+The walker descends through `BinaryOp`, `UnaryOp`, and `TernaryOp` (e.g. `qd.select`) nodes, so wrapping a composite expression protects the inner ops too:
+
+```python
+# All four FP ops below are tagged: the outer sqrt, the inner add, and the two inner muls.
+r = qd.precise(qd.sqrt(a * a + b * b))
+
+# Ternary is traversed through; the two branches and the condition's inner ops are tagged.
+r = qd.precise(qd.select(cond, a + b, a - b))
+```
+
+## Where the walker stops
+
+`qd.precise` does not descend into:
+
+- Loads (ndarray indexing, field access)
+- Constants
+- `qd.func` call sites
+- Atomic ops
+- Intermediate Python variable assignments (`tmp = a + b` wraps the RHS in an internal alloca, so `qd.precise(tmp)` sees the alloca, not the inner `BinaryOp`, and is a silent no-op)
+
+Semantics inside a `qd.func` body are governed by that body's own ops. If you want IEEE-strict behavior inside a called function, wrap the relevant ops inside the function's body, not at the call site. Similarly, wrap `qd.precise` directly around the expression rather than around a variable that was assigned earlier:
+
+```python
+@qd.func
+def dot_precise(a, b, c, d):
+    # Wrap inside the body, not at the caller.
+    return qd.precise(a * b + c * d)
+
+@qd.kernel
+def k(...):
+    r = dot_precise(x, y, z, w)   # inner ops are already precise
+```
+
+## Interaction with fast_math
+
+`qd.precise` is a per-op override. It takes effect whether `fast_math` is on or off:
+
+| Setting | Non-precise op | `qd.precise` op |
+|---|---|---|
+| `fast_math=True` | reassoc / contract / simplify | IEEE-strict |
+| `fast_math=False` | mostly IEEE-strict (*) | IEEE-strict |
+
+(*) Under `fast_math=False` most rewrites are already globally disabled, but the `a + 0 -> a` fold for FP adds is gated on `qd.precise` only (not on `fast_math`), so `(-0.0) + 0.0` still folds to `-0.0` without the tag. `qd.precise` is therefore not fully redundant under `fast_math=False` for code that depends on signed-zero semantics.
+
+The recommended workflow is to leave `fast_math=True` globally for throughput and reach for `qd.precise` only in the handful of spots that need IEEE behavior.
+
+## Backend coverage
+
+| Backend | Reassoc / contraction / algebraic folds | Approximate transcendentals (`sin` / `cos` / `log`) |
+|---|---|---|
+| CPU | LLVM FMF cleared | libc `sinf` is already correctly rounded |
+| CUDA | LLVM FMF cleared | libdevice `__nv_<fn>f` (non-fast) selected |
+| AMDGPU | LLVM FMF cleared | `__ocml_<fn>` already correctly rounded |
+| Vulkan / MoltenVK | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) |
+| Metal | SPIR-V `NoContraction` decoration | best-effort: driver stdlib default (spec only guarantees 2^-11 absolute error) |
+
+On SPIR-V backends, `NoContraction` is defined by the spec to apply to arithmetic instructions only; most consumers ignore it on the `OpExtInst` calls used for transcendentals. The decoration is still emitted (it is harmless and future-proofs against downstream toolchains that start honoring it), but correctness of `qd.precise(qd.sin(x))` / `qd.precise(qd.cos(x))` on Metal / Vulkan cannot be guaranteed through the tag: the Vulkan precision requirements for GLSL.std.450 `Sin`/`Cos` are stated as 2^-11 absolute error, which on inputs whose reference magnitude is smaller than 1 is thousands of ULPs, and drivers are within their rights to saturate that latitude. If you need correctly-rounded sin/cos, use the CPU / CUDA / AMDGPU backends.
+
+## Example: Dekker 2Sum
+
+A textbook compensated addition that computes `s + e = a + b` exactly in f32:
+
+```python
+@qd.func
+def two_sum(a, b):
+    s = qd.precise(a + b)
+    bb = qd.precise(s - a)
+    aa = qd.precise(s - bb)
+    e = qd.precise((a - aa) + (b - bb))
+    return s, e
+```
+
+Without the `qd.precise` wrappers, under `fast_math=True` the compiler recognizes `(a - (s - (s - a))) + (b - (s - a))` as algebraically zero and folds `e` to `0`. The wrappers prevent that fold, and `s + e` reproduces `a + b` to full precision.
+
+## Caveats
+
+- `qd.precise` is a scalar primitive. Passing a `Vector` / `Matrix` will raise. Apply it to individual components instead, or refactor your expression to use scalar ops inside.
+- `qd.precise` does not mutate its input. It returns a fresh expression subtree with every reachable FP op tagged; the original expression is unchanged. Reusing the original elsewhere is safe and never inherits the tag.
diff --git a/python/quadrants/lang/ops.py b/python/quadrants/lang/ops.py
@@ -95,6 +95,59 @@ def cast(obj, dtype):
     return expr.Expr(_qd_core.value_cast(expr.Expr(obj).ptr, dtype))
 
 
+def precise(obj):
+    """Mark a floating-point expression as IEEE-strict.
+
+    Every binary and unary FP op inside ``obj`` is evaluated in source
+    order with no reassociation, no FMA contraction, no approximate
+    transcendental substitution, and no non-IEEE-exact algebraic
+    simplification, regardless of the module-level :attr:`fast_math`
+    setting. Folds that are IEEE-exact for every input (e.g.
+    ``a - 0 -> a``, ``a > a -> false``) are still applied. This is
+    equivalent to MSL's / HLSL's ``precise`` keyword and lets you keep
+    ``fast_math=True`` globally while protecting compensated-arithmetic
+    blocks (Dekker / Kahan 2Sum, Veltkamp split, etc.) from being folded
+    away.
+
+    Recursion descends through ``BinaryOp``, ``UnaryOp`` (cast, bit_cast,
+    neg, sqrt, ...), and ``TernaryOp`` (select) wrappers so that inner
+    binary ops are reached even when wrapped, e.g.
+    ``qd.precise(qd.bit_cast(a + b, qd.f32))``. It stops at loads,
+    constants, ``qd.func`` calls, ndarray accesses, etc.; semantics inside
+    a ``qd.func`` body are governed by that body's own ops - wrap calls
+    separately if needed.
+
+    Notes:
+        * ``qd.precise`` does NOT mutate the input expression. It returns
+          a fresh subtree that mirrors the input's structure, with every
+          reachable Binary / Unary / Ternary node cloned and the new
+          Binary / Unary nodes tagged as ``precise``. Non-walked nodes
+          (loads, constants, ``qd.func`` calls, ndarray accesses, ...)
+          are shared with the input by reference. The practical upshot:
+          reusing the original (pre-``precise``) expression value
+          elsewhere is safe - it will NOT pick up the tag.
+
+    Args:
+        obj: A scalar Quadrants expression (typically a chain of FP ops).
+
+    Returns:
+        A fresh expression subtree with every reachable binary and unary
+        FP op tagged as ``precise``. The original ``obj`` is unchanged.
+
+    Example::
+
+        >>> @qd.func
+        >>> def fast_two_sum(a, b):
+        >>>     # Local IEEE region, survives even with fast_math=True.
+        >>>     s = qd.precise(a + b)
+        >>>     e = qd.precise(b - (s - a))
+        >>>     return s, e
+    """
+    if is_quadrants_class(obj):
+        raise ValueError("Cannot apply precise on Quadrants classes")
+    return expr.Expr(_qd_core.precise(expr.Expr(obj).ptr))
+
+
 def bit_cast(obj, dtype):
     """Copy and cast a scalar to a specified data type with its underlying
     bits preserved. Must be called in quadrants scope.
@@ -1535,4 +1588,5 @@ def min(*args):  # pylint: disable=W0622
     "select",
     "abs",
     "pow",
+    "precise",
 ]
diff --git a/quadrants/analysis/gen_offline_cache_key.cpp b/quadrants/analysis/gen_offline_cache_key.cpp
@@ -88,6 +88,7 @@ class ASTSerializer : public IRVisitor, public ExpressionVisitor {
   void visit(UnaryOpExpression *expr) override {
     emit(ExprOpCode::UnaryOpExpression);
     emit(expr->type);
+    emit(expr->precise);
     if (expr->is_cast()) {
       emit(expr->cast_type);
     }
@@ -97,6 +98,7 @@ class ASTSerializer : public IRVisitor, public ExpressionVisitor {
   void visit(BinaryOpExpression *expr) override {
     emit(ExprOpCode::BinaryOpExpression);
     emit(expr->type);
+    emit(expr->precise);
     emit(expr->lhs);
     emit(expr->rhs);
   }

diff --git a/quadrants/codegen/amdgpu/codegen_amdgpu.cpp b/quadrants/codegen/amdgpu/codegen_amdgpu.cpp
@@ -389,6 +389,11 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
     if (op != BinaryOpType::atan2 && op != BinaryOpType::pow) {
       return TaskCodeGenLLVM::visit(stmt);
     }
+    // The base-class `visit(BinaryOpStmt*)` terminates with `if (stmt->precise) disable_fast_math(...)` so LLVM cannot
+    // substitute approximate variants for precise-tagged FP ops. The AMDGPU override below returns without chaining to
+    // the base, so we mirror that same guard on the __ocml_* call results. AMDGPU's `__ocml_*` transcendentals are
+    // currently correctly-rounded (no `__ocml_fast_*` variants), so this is defensive against future libocml changes
+    // rather than a bug today.
     auto lhs = llvm_val[stmt->lhs];
     auto rhs = llvm_val[stmt->rhs];
 
@@ -403,6 +408,13 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
         auto sitofp_lhs_ = builder->CreateSIToFP(lhs, llvm::Type::getDoubleTy(*llvm_context));
         auto sitofp_rhs_ = builder->CreateSIToFP(rhs, llvm::Type::getDoubleTy(*llvm_context));
         auto ret_ = call("__ocml_pow_f64", {sitofp_lhs_, sitofp_rhs_});
+        // FPToSI is not an FPMathOperator, so the post-hoc `disable_fast_math(llvm_val[stmt])` below would be a no-op
+        // on it and leave the `__ocml_pow_f64` CallInst still carrying the IRBuilder's `afn` / `reassoc` / ... Clear
+        // FMF here on the actual call before its handle is overwritten by the FPToSI. Mirrors the f16 FPTrunc guards
+        // in `codegen_llvm.cpp` and `codegen_cuda.cpp::emit_extra_unary`.
+        if (stmt->precise) {
+          disable_fast_math(ret_);
+        }
         llvm_val[stmt] = builder->CreateFPToSI(ret_, llvm::Type::getInt32Ty(*llvm_context));
       } else {
         QD_NOT_IMPLEMENTED
@@ -418,6 +430,9 @@ class TaskCodeGenAMDGPU : public TaskCodeGenLLVM {
         QD_NOT_IMPLEMENTED
       }
     }
+    if (stmt->precise) {
+      disable_fast_math(llvm_val[stmt]);
+    }
   }
 
  private:

diff --git a/quadrants/codegen/cuda/codegen_cuda.cpp b/quadrants/codegen/cuda/codegen_cuda.cpp
@@ -218,6 +218,9 @@ class TaskCodeGenCUDA : public TaskCodeGenLLVM {
     }
 
     auto op = stmt->op_type;
+    // The fast-math libdevice variants (__nv_fast_*) bypass LLVM FMF entirely (they're plain function calls, not FP
+    // intrinsics), so qd.precise(...) has to opt out of them at each call site below.
+    const bool use_fast = compile_config.fast_math && !stmt->precise;
 
 #define UNARY_STD(x)                                                       \
   else if (op == UnaryOpType::x) {                                         \
@@ -288,8 +291,7 @@ class TaskCodeGenCUDA : public TaskCodeGenLLVM {
       }
     } else if (op == UnaryOpType::log) {
       if (input_quadrants_type->is_primitive(PrimitiveTypeID::f32)) {
-        // logf has fast-math option
-        llvm_val[stmt] = call(compile_config.fast_math ? "__nv_fast_logf" : "__nv_logf", input);
+        llvm_val[stmt] = call(use_fast ? "__nv_fast_logf" : "__nv_logf", input);
       } else if (input_quadrants_type->is_primitive(PrimitiveTypeID::f64)) {
         llvm_val[stmt] = call("__nv_log", input);
       } else if (input_quadrants_type->is_primitive(PrimitiveTypeID::i32)) {
@@ -299,8 +301,7 @@ class TaskCodeGenCUDA : public TaskCodeGenLLVM {
       }
     } else if (op == UnaryOpType::sin) {
       if (input_quadrants_type->is_primitive(PrimitiveTypeID::f32)) {
-        // sinf has fast-math option
-        llvm_val[stmt] = call(compile_config.fast_math ? "__nv_fast_sinf" : "__nv_sinf", input);
+        llvm_val[stmt] = call(use_fast ? "__nv_fast_sinf" : "__nv_sinf", input);
       } else if (input_quadrants_type->is_primitive(PrimitiveTypeID::f64)) {
         llvm_val[stmt] = call("__nv_sin", input);
       } else if (input_quadrants_type->is_primitive(PrimitiveTypeID::i32)) {
@@ -310,8 +311,7 @@ class TaskCodeGenCUDA : public TaskCodeGenLLVM {
       }
     } else if (op == UnaryOpType::cos) {
       if (input_quadrants_type->is_primitive(PrimitiveTypeID::f32)) {
-        // cosf has fast-math option
-        llvm_val[stmt] = call(compile_config.fast_math ? "__nv_fast_cosf" : "__nv_cosf", input);
+        llvm_val[stmt] = call(use_fast ? "__nv_fast_cosf" : "__nv_cosf", input);
       } else if (input_quadrants_type->is_primitive(PrimitiveTypeID::f64)) {
         llvm_val[stmt] = call("__nv_cos", input);
       } else if (input_quadrants_type->is_primitive(PrimitiveTypeID::i32)) {
@@ -332,7 +332,14 @@ class TaskCodeGenCUDA : public TaskCodeGenLLVM {
     }
 #undef UNARY_STD
     if (stmt->ret_type->is_primitive(PrimitiveTypeID::f16)) {
-      // Convert back to f16.
+      // Convert back to f16. FPTrunc is not an FPMathOperator, so the post-hoc
+      // `disable_fast_math(llvm_val[stmt])` in visit(UnaryOpStmt*) would be a no-op on it and leave
+      // the libdevice CallInst (an FPMathOperator when returning FP) still carrying the IRBuilder's
+      // `afn` / `reassoc` / ... Clear FMF here on the actual call before its handle is overwritten
+      // by the FPTrunc. Mirrors the guard in the base class emit_extra_unary().
+      if (stmt->precise) {
+        disable_fast_math(llvm_val[stmt]);
+      }
       llvm_val[stmt] = builder->CreateFPTrunc(llvm_val[stmt], llvm::Type::getHalfTy(*llvm_context));
     }
   }
@@ -703,10 +710,18 @@ class TaskCodeGenCUDA : public TaskCodeGenLLVM {
       }
     }
 
-    // Convert back to f16 if applicable.
+    // Convert back to f16 if applicable. Mirror the base class's pattern: clear FMF on the actual FP call before the
+    // FPTrunc overwrites its handle (FPTrunc is not an FPMathOperator). The AMDGPU override does the same; this branch
+    // of CUDA override previously skipped the clear entirely because the base class never runs for pow/atan2.
     if (stmt->ret_type->is_primitive(PrimitiveTypeID::f16)) {
+      if (stmt->precise) {
+        disable_fast_math(llvm_val[stmt]);
+      }
       llvm_val[stmt] = builder->CreateFPTrunc(llvm_val[stmt], llvm::Type::getHalfTy(*llvm_context));
     }
+    if (stmt->precise) {
+      disable_fast_math(llvm_val[stmt]);
+    }
   }
 
   void visit(InternalFuncStmt *stmt) override {