Yet Another FFT Library – Experimental SIMD FFT Playground

This repository contains an experimental, SIMD-accelerated 1-D FFT implementation in modern C++17, plus a small set of tests and benchmarks that compare it against libraries like FFTW, Intel IPP, Eigen’s FFT, and PFFFT.

It’s very much a personal side project for learning and performance exploration: nothing here should be considered production-ready.

Features at a Glance

C++17 FFT library in fft/ built as a static library fft
Single-precision (float) 1-D transforms
- Complex → complex
- Real → complex and complex → real
Plan-based API (similar in spirit to FFTW):
- fft_plan* generate_complex_plan(uint32_t N);
- real_plan* generate_real_plan(uint32_t N);
Supported sizes
- Complex: power-of-two sizes and primes (via Rader’s algorithm)
- Real: power-of-two sizes
SIMD-friendly layout
- Aligned, packed data types: packed_complex_ps and packed_real_ps
- Uses SIMDe headers bundled under fft/common/simde to access AVX/AVX2/AVX-512/NEON style intrinsics
Benchmarks
- flops_benchmark: approximate GFLOPS vs FFTW / IPP / PFFFT / Eigen
- nanobench: micro-benchmarks using ankerl::nanobench
- Performance of FFT Algorithms in GFLOPS.png: example plot of benchmark results
Correctness test harness
- test: numerical comparison vs Eigen’s FFT

Again, this is a learning playground, not a polished, general-purpose library.

Dependencies

You’ll need:

CMake ≥ 3.22
A C++17 compiler (GCC, Clang, or IntelLLVM should work)
Eigen 3.4+
- Found via find_package(Eigen3 3.4 REQUIRED NO_MODULE)
FFTW3 (single precision)
- Found via find_package(FFTW REQUIRED)
- The repo ships a custom FindFFTW.cmake under other_implementations/findFFTW/
Intel IPP (for IPP benchmark paths)
- CMake looks for it via find_package(IPP HINTS /opt/intel/oneapi/ipp/2021.10/lib/cmake/ipp)
- If you don’t have IPP installed, you can either install it or disable/comment out the IPP-dependent executables in CMakeLists.txt
A POSIX-like environment is assumed for some timing code (sysconf, times, etc.) and __rdtsc() is used for cycle timing.

The core fft library itself only depends on the standard library and the bundled SIMDe headers; Eigen, FFTW, and IPP are only needed for tests and benchmarks.

Building

From the repository root:

# Configure (Release build is where performance starts to matter)
cmake -B build -DCMAKE_BUILD_TYPE=Release

# Or for debugging
cmake -B build -DCMAKE_BUILD_TYPE=Debug

# Build everything
cmake --build build -j

This will produce:

build/libfft.a – the FFT library
build/test – correctness test
build/flops_benchmark – FLOPS benchmark
build/nanobench – micro-benchmark

For performance work, use Release so that the vectorized code paths and LTO flags are enabled (-O3 -march=native -flto=thin, etc.).

Public API Overview

Include the top-level header:

#include "fft/fft.h"

The main entry points (see fft/fft.h):

fft_plan* generate_complex_plan(uint32_t N);
real_plan* generate_real_plan(uint32_t N);

void fft_fwd_real(const real_plan* p,
                  const packed_real_ps& input,
                  packed_complex_ps& output);

void fft_inv_real(const real_plan* p,
                  const packed_complex_ps& input,
                  packed_real_ps& output);

void fft_fwd_complex(const fft_plan* p,
                     const packed_complex_ps& input,
                     packed_complex_ps& output);

void fft_inv_complex(const fft_plan* p,
                     const packed_complex_ps& input,
                     packed_complex_ps& output);

Data containers

The SIMD-friendly containers live in fft/common/types.h and are pulled in by fft/fft.h:

packed_complex_ps
packed_real_ps (inherits from packed_complex_ps)

Constructors you’re likely to use:

// Complex data: wrap an existing std::complex<float> array
packed_complex_ps packed(N, complex_ptr);

// Or allocate an aligned buffer of N complex floats
packed_complex_ps packed(N);

// Real data: wrap an existing float array of length N
packed_real_ps packed_real(N, float_ptr);

// Or allocate an aligned buffer for a real transform of length N
packed_real_ps packed_real(N);

Both packed_complex_ps and packed_real_ps have unpack_to(...) methods to materialize results back into standard containers.

Example Usage

Complex FFT

#include <vector>
#include <complex>
#include "fft/fft.h"

int main() {
    uint32_t N = 1u << 20; // power of two, required for now

    std::vector<std::complex<float>> data(N);
    // fill data...

    // Wrap in packed container
    packed_complex_ps in(N, data.data());
    packed_complex_ps out(N);

    // Build a plan (supports power-of-two and prime N)
    fft_plan* plan = generate_complex_plan(N);

    // Forward transform
    fft_fwd_complex(plan, in, out);

    // Backward transform (in-place or to a fresh buffer)
    packed_complex_ps recovered(N);
    fft_inv_complex(plan, out, recovered);

    recovered.unpack_to(data.data());

    delete plan;
    return 0;
}

Real FFT

#include <vector>
#include "fft/fft.h"

int main() {
    uint32_t N = 1u << 20; // must be a power of two

    std::vector<float> time_domain(N);
    // fill time_domain...

    packed_real_ps in(N, time_domain.data());
    packed_complex_ps freq(N / 2); // internally uses N/2 complex numbers

    real_plan* plan = generate_real_plan(N);

    fft_fwd_real(plan, in, freq);
    // ... process freq ...

    packed_real_ps recovered(N);
    fft_inv_real(plan, freq, recovered);
    recovered.unpack_to(time_domain.data());

    delete plan;
    return 0;
}

Running Tests and Benchmarks

Assuming you’ve built in build/:

Correctness test

./build/test

This program compares the implementation against Eigen’s FFT for a selection of sizes and reports relative error.

FLOPS benchmark

./build/flops_benchmark

This executable sweeps over a list of sizes and prints a table of approximate GFLOPS for:

This implementation
FFTW
PFFFT
Eigen FFT
Intel IPP (if available)

The included Performance of FFT Algorithms in GFLOPS.png is produced from runs of this benchmark.

Micro-benchmarks

./build/nanobench

Uses ankerl::nanobench to time individual kernels / transforms at very small granularities, with cache flushing between runs.

Repository Layout

CMakeLists.txt – top-level build configuration
fft/
- fft.h, fft.cpp – public API and plan selection (power-of-two vs prime sizes)
- common/
  - types.h – SIMD-friendly data types, alignment helpers, etc.
  - number_theoretic.* – number-theoretic helpers (Miller–Rabin, primitive roots, CRT helpers)
  - simde/ – vendored SIMDe headers for cross-platform SIMD
- core/
  - kernels.* – base radix-2 / radix-4 scalar kernels
  - vector_kernels.* – AVX-style vectorized kernels
  - fft_pow2.* – FFT plan generation and execution for power-of-two sizes
  - thread_pool.h – placeholder for future parallelization
- prime/
  - fft_rader.* – Rader’s algorithm for prime-length FFTs
flops_benchmark.cpp – GFLOPS benchmark vs FFTW / IPP / Eigen / PFFFT
nanobenchmark.cpp – micro-benchmark harness
test.cpp – correctness tests vs Eigen FFT
other_implementations/
- pffft.[ch] – PFFFT reference implementation
- nanobench.h – single-header benchmark library
- VariadicTable.h – pretty-printing tables
- findFFTW/FindFFTW.cmake – FFTW CMake find script
Performance of FFT Algorithms in GFLOPS.png – example benchmark plot

Limitations and Caveats

Since this is a side project, there are some limitations:

Complex plans: currently assert that N is either prime or a power of two.
Real plans: N must be a power of two.
Dimensionality: only 1-D transforms are implemented.
Threading: there’s a placeholder thread_pool.h, but transforms are effectively single-threaded at the moment.
Portability: the code is intended to be portable via SIMDe, but hasn't been tested beyond an M1 and a Ryzen 7950x

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fft		fft
other_implementations		other_implementations
.clang-format		.clang-format
CMakeLists.txt		CMakeLists.txt
README.md		README.md
flops_benchmark.cpp		flops_benchmark.cpp
main.cpp		main.cpp
nanobenchmark.cpp		nanobenchmark.cpp
perf_plot.png		perf_plot.png
test.cpp		test.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yet Another FFT Library – Experimental SIMD FFT Playground

Features at a Glance

Dependencies

Building

Public API Overview

Data containers

Example Usage

Complex FFT

Real FFT

Running Tests and Benchmarks

Correctness test

FLOPS benchmark

Micro-benchmarks

Repository Layout

Limitations and Caveats

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Yet Another FFT Library – Experimental SIMD FFT Playground

Features at a Glance

Dependencies

Building

Public API Overview

Data containers

Example Usage

Complex FFT

Real FFT

Running Tests and Benchmarks

Correctness test

FLOPS benchmark

Micro-benchmarks

Repository Layout

Limitations and Caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages