A high-performance hardware implementation of special mathematical functions for single-precision floating-point numbers (FP32), designed using Chisel HDL.
This project implements a unified Special Function Unit (SFU) that supports seven FP32 functions using a shared pipelined architecture based on fixed-point quadratic interpolation with lookup tables (LUT):
-
EXP2: Base-2 exponential
$2^x$ -
LOG2: Base-2 logarithm
$\log_2(x)$ -
RCP: Reciprocal
$1/x$ -
SQRT: Square root
$\sqrt{x}$ -
RSQRT: Reciprocal square root
$1/\sqrt{x}$ -
SIN:
$\sin!\left(\frac{\pi}{2} x\right)$ -
COS:
$\cos!\left(\frac{\pi}{2} x\right)$
The architecture is based on the multifunction interpolation approach described by Oberman and Siu [1], which is consistent with the design used in NVIDIA SFUs.
Each function is approximated by a quadratic polynomial over a set of small sub-intervals. Given an input
The three coefficients
For EXP2, the argument is decomposed as
For SIN/COS, the functions computed are quadrant = floor(x) mod 4 is read from the two low bits of the integer part of
The final result is assembled by combining the polynomial output with the input exponent according to each function's composition rule.
S0: Filter (1 cycle)
- Handle special values: NaN, ±Inf, ±0, subnormals (treated as zero)
- Handle out-of-range inputs for EXP2 (overflow/underflow)
- Handle negative inputs for LOG2, SQRT, RSQRT
- Output bypass flag and bypass value for special cases
- Pass normal inputs to next stage
S1: RangeReduce (1 cycle)
- EXP2/SIN/COS: Compute integer and fractional decomposition
- SIN/COS: quadrant = floor(x) mod 4; map fractional part f to t = f or 1-f based on quadrant[0]; sign from quadrant[1]
- LOG2/RCP/SQRT/RSQRT: Extract exponent and split mantissa into index + xl
- SQRT/RSQRT: index[6] = E mod 2 (selects even/odd LUT)
- Output: index (7 bits), xl (17 bits), exp (8 bits signed), sign (1 bit)
S2: LUT (1 cycle)
- Index: 7 bits → 128 entries per function
- EXP2/LOG2/SIN/COS: 64-entry tables (index[5:0])
- RCP: 128-entry table (full 7-bit index)
- SQRT/RSQRT: two 64-entry tables (even/odd), selected by index[6]
- Each entry stores (c0, c1, c2) as signed fixed-point integers
S3: Poly Stage 1 — Squarer (1 cycle)
- Compute xl^2 (17×17 unsigned multiply)
- Truncate and shift to 15-bit aligned representation
S4: Poly Stage 2 — Parallel Multiply (1 cycle)
- Compute c1 × xl (17×17 signed multiply → 35 bits)
- Compute c2 × xl² (13×15 signed multiply → 29 bits)
S5: Poly Stage 3 — Align and Sum (1 cycle)
- Align c1·xl and c2·xl² to the scale of c0
- Compute polyResult = c0 + aligned(c1·xl) + aligned(c2·xl²)
- Output: 27-bit fixed-point result
S6: Compose (1 cycle)
- Assemble final FP32 from polyResult and exp:
· EXP2: exp_out = 127 + exp, mant = polyResult[24:2]
· LOG2: leading-zero detection on {exp, polyResult[25:0]}, normalize
· RCP: exp_out = 126 - exp, mant = polyResult[24:2]
· SQRT: exp_out = 127 + exp, mant = polyResult[24:2]
· RSQRT: exp_out = 126 - exp, mant = polyResult[24:2]
· SIN/COS: leading-zero normalization; clamp to ±1 if polyResult overflows
- If bypass flag set, output bypassVal; otherwise output assembled result
- Unified Architecture: Single 7-stage pipeline serves all seven functions
- Full Throughput: One result per cycle after initial latency
- 128-Entry LUTs: Per-function coefficient tables (some split into even/odd for SQRT/RSQRT)
- Fixed-Point Polynomial: Quadratic approximation in integer arithmetic — no FP multipliers in datapath
- Offline Coefficient Optimization: Coefficients minimized via minimax optimization (optimizer tool)
- Special Value Handling: Comprehensive coverage of IEEE 754 edge cases
Coefficients are optimized offline using the optimizer tool to minimize the worst-case absolute error within each sub-interval. The following tables compare accuracy against NVIDIA GPU results (measured with -use_fast_math) across
| Interval | Implementation | MaxAbsErr | MaxULP | AvgAbsErr | AvgULP |
|---|---|---|---|---|---|
| [0.25, 0.5) | This work | 2.384e-07 | 2 | 4.314e-08 | 0.36 |
| NVIDIA GPU | 1.192e-07 | 1 | 3.020e-08 | 0.25 | |
| [0.5, 1) | This work | 2.384e-07 | 2 | 4.269e-08 | 0.36 |
| NVIDIA GPU | 1.192e-07 | 1 | 3.746e-08 | 0.31 | |
| [1, 2) | This work | 2.384e-07 | 1 | 2.544e-08 | 0.11 |
| NVIDIA GPU | 2.384e-07 | 1 | 9.273e-08 | 0.39 | |
| [2, 4) | This work | 9.537e-07 | 1 | 7.624e-08 | 0.11 |
| NVIDIA GPU | 9.537e-07 | 1 | 2.801e-07 | 0.39 |
| Interval | Implementation | MaxAbsErr | MaxULP | AvgAbsErr | AvgULP |
|---|---|---|---|---|---|
| [0.25, 0.5) | This work | 2.384e-07 | 2 | 6.518e-08 | 0.55 |
| NVIDIA GPU | 2.384e-07 | 2 | 1.201e-07 | 1.01 | |
| [0.5, 1) | This work | 1.192e-07 | 1.6M | 2.775e-08 | 3.43 |
| NVIDIA GPU | 2.384e-07 | 24M | 7.843e-08 | 20.06 | |
| [1, 2) | This work | 8.941e-08 | 866M | 1.802e-08 | 109.79 |
| NVIDIA GPU | 1.602e-07 | 6.1M | 4.342e-08 | 11.20 | |
| [2, 4) | This work | 2.384e-07 | 2 | 3.972e-08 | 0.33 |
| NVIDIA GPU | 1.192e-07 | 1 | 2.954e-08 | 0.25 |
Note on LOG2 ULP near 1.0: Very large ULP values in
$[0.5, 2)$ are expected. Near$x = 1$ ,$\log_2(x) \approx 0$ , so any small absolute error maps to an enormous ULP count. The absolute errors remain below$2^{-22}$ throughout.
| Interval | Implementation | MaxAbsErr | MaxULP | AvgAbsErr | AvgULP |
|---|---|---|---|---|---|
| [0.25, 0.5) | This work | 2.384e-07 | 1 | 2.183e-08 | 0.09 |
| NVIDIA GPU | 2.384e-07 | 1 | 3.151e-08 | 0.13 | |
| [0.5, 1) | This work | 1.192e-07 | 1 | 1.092e-08 | 0.09 |
| NVIDIA GPU | 1.192e-07 | 1 | 1.575e-08 | 0.13 | |
| [1, 2) | This work | 5.960e-08 | 1 | 5.459e-09 | 0.09 |
| NVIDIA GPU | 5.960e-08 | 1 | 7.877e-09 | 0.13 | |
| [2, 4) | This work | 2.980e-08 | 1 | 2.729e-09 | 0.09 |
| NVIDIA GPU | 2.980e-08 | 1 | 3.938e-09 | 0.13 |
| Interval | Implementation | MaxAbsErr | MaxULP | AvgAbsErr | AvgULP |
|---|---|---|---|---|---|
| [0.25, 0.5) | This work | 5.960e-08 | 1 | 4.992e-09 | 0.08 |
| NVIDIA GPU | 5.960e-08 | 1 | 9.946e-09 | 0.17 | |
| [0.5, 1) | This work | 5.960e-08 | 1 | 4.900e-09 | 0.08 |
| NVIDIA GPU | 5.960e-08 | 1 | 1.016e-08 | 0.17 | |
| [1, 2) | This work | 1.192e-07 | 1 | 9.985e-09 | 0.08 |
| NVIDIA GPU | 1.192e-07 | 1 | 1.989e-08 | 0.17 | |
| [2, 4) | This work | 1.192e-07 | 1 | 9.801e-09 | 0.08 |
| NVIDIA GPU | 1.192e-07 | 1 | 2.032e-08 | 0.17 |
| Interval | Implementation | MaxAbsErr | MaxULP | AvgAbsErr | AvgULP |
|---|---|---|---|---|---|
| [0.25, 0.5) | This work | 1.192e-07 | 1 | 1.692e-08 | 0.14 |
| NVIDIA GPU | 2.384e-07 | 2 | 2.677e-08 | 0.22 | |
| [0.5, 1) | This work | 1.192e-07 | 1 | 1.388e-08 | 0.12 |
| NVIDIA GPU | 1.192e-07 | 1 | 2.488e-08 | 0.21 | |
| [1, 2) | This work | 5.960e-08 | 1 | 8.461e-09 | 0.14 |
| NVIDIA GPU | 1.192e-07 | 2 | 1.338e-08 | 0.22 | |
| [2, 4) | This work | 5.960e-08 | 1 | 6.942e-09 | 0.12 |
| NVIDIA GPU | 5.960e-08 | 1 | 1.244e-08 | 0.21 |
| Interval | Implementation | MaxAbsErr | MaxULP | AvgAbsErr | AvgULP |
|---|---|---|---|---|---|
| [0.25, 0.5) | This work | 2.980e-07 | 12 | 1.068e-07 | 3.61 |
| NVIDIA GPU | 3.278e-07 | 11 | 1.195e-07 | 4.03 | |
| [0.5, 1) | This work | 2.980e-07 | 10 | 1.019e-07 | 1.80 |
| NVIDIA GPU | 3.576e-07 | 10 | 1.140e-07 | 2.00 | |
| [1, 2) | This work | 2.384e-07 | 4 | 5.998e-08 | 1.01 |
| NVIDIA GPU | 2.384e-07 | 4 | 4.592e-08 | 0.77 | |
| [2, 4) | This work | 4.470e-07 | 6.4M | 1.214e-07 | 18.70 |
| NVIDIA GPU | 6.557e-07 | 884M | 1.678e-07 | 242.62 |
| Interval | Implementation | MaxAbsErr | MaxULP | AvgAbsErr | AvgULP |
|---|---|---|---|---|---|
| [0.25, 0.5) | This work | 1.788e-07 | 3 | 6.779e-08 | 1.14 |
| NVIDIA GPU | 2.384e-07 | 4 | 4.643e-08 | 0.78 | |
| [0.5, 1) | This work | 2.980e-07 | 5 | 1.024e-07 | 1.72 |
| NVIDIA GPU | 2.980e-07 | 5 | 7.581e-08 | 1.27 | |
| [1, 2) | This work | 2.980e-07 | 8.7M | 1.017e-07 | 28.40 |
| NVIDIA GPU | 4.023e-07 | 875M | 1.188e-07 | 347.30 | |
| [2, 4) | This work | 2.980e-07 | 9 | 7.467e-08 | 1.33 |
| NVIDIA GPU | 4.172e-07 | 13 | 9.454e-08 | 1.72 |
Note on SIN/COS ULP near zeros: Large ULP values near
$\sin(x) = 0$ or$\cos(x) = 0$ are expected for the same reason as LOG2 near 1. The absolute errors remain small.
| Function | NVIDIA Spec | This Work (max over [0.5, 4)) | Status |
|---|---|---|---|
| EXP2 | Max 2 ULP | Max 2 ULP | Satisfied |
| LOG2 | Max |
Max 2.384e-07 ( |
Satisfied |
| RCP | Max 1 ULP | Max 1 ULP | Satisfied |
| SQRT | Max rel err |
Max rel err 1.192e-07 ( |
Satisfied |
| RSQRT | Max rel err |
Max rel err 1.192e-07 ( |
Satisfied |
- Chisel 6.6.0: Hardware description language
- Scala 2.13.15: Programming language for Chisel
- Mill: Build tool for Scala/Chisel projects
- Verilator: For simulation and verification
- CUDA/NVCC: For GPU-accelerated reference implementation (NVIDIA GPU required)
- Synopsys Design Compiler: For ASIC synthesis
Verilator-based testbench with:
- Test vector generation across full FP32 range
$[0.25, 4)$ in sub-intervals - Special value testing (NaN, Inf, zero, negative numbers, subnormals)
- ULP (Unit in Last Place) error measurement
- Waveform generation (FST format) for debugging
- ULP Error: Floating-point accuracy in units of the last place
- Relative Error: Percentage deviation from reference value
- Absolute Error: Absolute difference from reference value
[1] S. F. Oberman and M. Y. Siu, "A high-performance area-efficient multifunction interpolator," 17th IEEE Symposium on Computer Arithmetic (ARITH'05), Cape Cod, MA, USA, 2005, pp. 272–279, doi: 10.1109/ARITH.2005.7.