This repository contains an experimental, SIMD-accelerated 1-D FFT implementation in modern C++17, plus a small set of tests and benchmarks that compare it against libraries like FFTW, Intel IPP, Eigen’s FFT, and PFFFT.
It’s very much a personal side project for learning and performance exploration: nothing here should be considered production-ready.
-
C++17 FFT library in
fft/built as a static libraryfft -
Single-precision (float) 1-D transforms
- Complex → complex
- Real → complex and complex → real
-
Plan-based API (similar in spirit to FFTW):
fft_plan* generate_complex_plan(uint32_t N);real_plan* generate_real_plan(uint32_t N);
-
Supported sizes
- Complex: power-of-two sizes and primes (via Rader’s algorithm)
- Real: power-of-two sizes
-
SIMD-friendly layout
- Aligned, packed data types:
packed_complex_psandpacked_real_ps - Uses SIMDe headers bundled under
fft/common/simdeto access AVX/AVX2/AVX-512/NEON style intrinsics
- Aligned, packed data types:
-
Benchmarks
flops_benchmark: approximate GFLOPS vs FFTW / IPP / PFFFT / Eigennanobench: micro-benchmarks usingankerl::nanobenchPerformance of FFT Algorithms in GFLOPS.png: example plot of benchmark results
-
Correctness test harness
test: numerical comparison vs Eigen’s FFT
Again, this is a learning playground, not a polished, general-purpose library.
You’ll need:
-
CMake ≥ 3.22
-
A C++17 compiler (GCC, Clang, or IntelLLVM should work)
-
Eigen 3.4+
- Found via
find_package(Eigen3 3.4 REQUIRED NO_MODULE)
- Found via
-
FFTW3 (single precision)
- Found via
find_package(FFTW REQUIRED) - The repo ships a custom
FindFFTW.cmakeunderother_implementations/findFFTW/
- Found via
-
Intel IPP (for IPP benchmark paths)
- CMake looks for it via
find_package(IPP HINTS /opt/intel/oneapi/ipp/2021.10/lib/cmake/ipp) - If you don’t have IPP installed, you can either install it or disable/comment out the IPP-dependent executables in
CMakeLists.txt
- CMake looks for it via
-
A POSIX-like environment is assumed for some timing code (
sysconf,times, etc.) and__rdtsc()is used for cycle timing.
The core fft library itself only depends on the standard library and the bundled SIMDe headers; Eigen, FFTW, and IPP are only needed for tests and benchmarks.
From the repository root:
# Configure (Release build is where performance starts to matter)
cmake -B build -DCMAKE_BUILD_TYPE=Release
# Or for debugging
cmake -B build -DCMAKE_BUILD_TYPE=Debug
# Build everything
cmake --build build -jThis will produce:
build/libfft.a– the FFT librarybuild/test– correctness testbuild/flops_benchmark– FLOPS benchmarkbuild/nanobench– micro-benchmark
For performance work, use Release so that the vectorized code paths and LTO flags are enabled (-O3 -march=native -flto=thin, etc.).
Include the top-level header:
#include "fft/fft.h"The main entry points (see fft/fft.h):
fft_plan* generate_complex_plan(uint32_t N);
real_plan* generate_real_plan(uint32_t N);
void fft_fwd_real(const real_plan* p,
const packed_real_ps& input,
packed_complex_ps& output);
void fft_inv_real(const real_plan* p,
const packed_complex_ps& input,
packed_real_ps& output);
void fft_fwd_complex(const fft_plan* p,
const packed_complex_ps& input,
packed_complex_ps& output);
void fft_inv_complex(const fft_plan* p,
const packed_complex_ps& input,
packed_complex_ps& output);The SIMD-friendly containers live in fft/common/types.h and are pulled in by fft/fft.h:
packed_complex_pspacked_real_ps(inherits frompacked_complex_ps)
Constructors you’re likely to use:
// Complex data: wrap an existing std::complex<float> array
packed_complex_ps packed(N, complex_ptr);
// Or allocate an aligned buffer of N complex floats
packed_complex_ps packed(N);
// Real data: wrap an existing float array of length N
packed_real_ps packed_real(N, float_ptr);
// Or allocate an aligned buffer for a real transform of length N
packed_real_ps packed_real(N);Both packed_complex_ps and packed_real_ps have unpack_to(...) methods to materialize results back into standard containers.
#include <vector>
#include <complex>
#include "fft/fft.h"
int main() {
uint32_t N = 1u << 20; // power of two, required for now
std::vector<std::complex<float>> data(N);
// fill data...
// Wrap in packed container
packed_complex_ps in(N, data.data());
packed_complex_ps out(N);
// Build a plan (supports power-of-two and prime N)
fft_plan* plan = generate_complex_plan(N);
// Forward transform
fft_fwd_complex(plan, in, out);
// Backward transform (in-place or to a fresh buffer)
packed_complex_ps recovered(N);
fft_inv_complex(plan, out, recovered);
recovered.unpack_to(data.data());
delete plan;
return 0;
}#include <vector>
#include "fft/fft.h"
int main() {
uint32_t N = 1u << 20; // must be a power of two
std::vector<float> time_domain(N);
// fill time_domain...
packed_real_ps in(N, time_domain.data());
packed_complex_ps freq(N / 2); // internally uses N/2 complex numbers
real_plan* plan = generate_real_plan(N);
fft_fwd_real(plan, in, freq);
// ... process freq ...
packed_real_ps recovered(N);
fft_inv_real(plan, freq, recovered);
recovered.unpack_to(time_domain.data());
delete plan;
return 0;
}Assuming you’ve built in build/:
./build/testThis program compares the implementation against Eigen’s FFT for a selection of sizes and reports relative error.
./build/flops_benchmarkThis executable sweeps over a list of sizes and prints a table of approximate GFLOPS for:
- This implementation
- FFTW
- PFFFT
- Eigen FFT
- Intel IPP (if available)
The included Performance of FFT Algorithms in GFLOPS.png is produced from runs of this benchmark.
./build/nanobenchUses ankerl::nanobench to time individual kernels / transforms at very small granularities, with cache flushing between runs.
-
CMakeLists.txt– top-level build configuration -
fft/-
fft.h,fft.cpp– public API and plan selection (power-of-two vs prime sizes) -
common/types.h– SIMD-friendly data types, alignment helpers, etc.number_theoretic.*– number-theoretic helpers (Miller–Rabin, primitive roots, CRT helpers)simde/– vendored SIMDe headers for cross-platform SIMD
-
core/kernels.*– base radix-2 / radix-4 scalar kernelsvector_kernels.*– AVX-style vectorized kernelsfft_pow2.*– FFT plan generation and execution for power-of-two sizesthread_pool.h– placeholder for future parallelization
-
prime/fft_rader.*– Rader’s algorithm for prime-length FFTs
-
-
flops_benchmark.cpp– GFLOPS benchmark vs FFTW / IPP / Eigen / PFFFT -
nanobenchmark.cpp– micro-benchmark harness -
test.cpp– correctness tests vs Eigen FFT -
other_implementations/pffft.[ch]– PFFFT reference implementationnanobench.h– single-header benchmark libraryVariadicTable.h– pretty-printing tablesfindFFTW/FindFFTW.cmake– FFTW CMake find script
-
Performance of FFT Algorithms in GFLOPS.png– example benchmark plot
Since this is a side project, there are some limitations:
- Complex plans: currently assert that
Nis either prime or a power of two. - Real plans:
Nmust be a power of two. - Dimensionality: only 1-D transforms are implemented.
- Threading: there’s a placeholder
thread_pool.h, but transforms are effectively single-threaded at the moment. - Portability: the code is intended to be portable via SIMDe, but hasn't been tested beyond an M1 and a Ryzen 7950x