Skip to content

vnchari/yafft

Repository files navigation

Yet Another FFT Library – Experimental SIMD FFT Playground

perf_plot.png This repository contains an experimental, SIMD-accelerated 1-D FFT implementation in modern C++17, plus a small set of tests and benchmarks that compare it against libraries like FFTW, Intel IPP, Eigen’s FFT, and PFFFT.

It’s very much a personal side project for learning and performance exploration: nothing here should be considered production-ready.


Features at a Glance

  • C++17 FFT library in fft/ built as a static library fft

  • Single-precision (float) 1-D transforms

    • Complex → complex
    • Real → complex and complex → real
  • Plan-based API (similar in spirit to FFTW):

    • fft_plan* generate_complex_plan(uint32_t N);
    • real_plan* generate_real_plan(uint32_t N);
  • Supported sizes

    • Complex: power-of-two sizes and primes (via Rader’s algorithm)
    • Real: power-of-two sizes
  • SIMD-friendly layout

    • Aligned, packed data types: packed_complex_ps and packed_real_ps
    • Uses SIMDe headers bundled under fft/common/simde to access AVX/AVX2/AVX-512/NEON style intrinsics
  • Benchmarks

    • flops_benchmark: approximate GFLOPS vs FFTW / IPP / PFFFT / Eigen
    • nanobench: micro-benchmarks using ankerl::nanobench
    • Performance of FFT Algorithms in GFLOPS.png: example plot of benchmark results
  • Correctness test harness

    • test: numerical comparison vs Eigen’s FFT

Again, this is a learning playground, not a polished, general-purpose library.


Dependencies

You’ll need:

  • CMake ≥ 3.22

  • A C++17 compiler (GCC, Clang, or IntelLLVM should work)

  • Eigen 3.4+

    • Found via find_package(Eigen3 3.4 REQUIRED NO_MODULE)
  • FFTW3 (single precision)

    • Found via find_package(FFTW REQUIRED)
    • The repo ships a custom FindFFTW.cmake under other_implementations/findFFTW/
  • Intel IPP (for IPP benchmark paths)

    • CMake looks for it via find_package(IPP HINTS /opt/intel/oneapi/ipp/2021.10/lib/cmake/ipp)
    • If you don’t have IPP installed, you can either install it or disable/comment out the IPP-dependent executables in CMakeLists.txt
  • A POSIX-like environment is assumed for some timing code (sysconf, times, etc.) and __rdtsc() is used for cycle timing.

The core fft library itself only depends on the standard library and the bundled SIMDe headers; Eigen, FFTW, and IPP are only needed for tests and benchmarks.


Building

From the repository root:

# Configure (Release build is where performance starts to matter)
cmake -B build -DCMAKE_BUILD_TYPE=Release

# Or for debugging
cmake -B build -DCMAKE_BUILD_TYPE=Debug

# Build everything
cmake --build build -j

This will produce:

  • build/libfft.a – the FFT library
  • build/test – correctness test
  • build/flops_benchmark – FLOPS benchmark
  • build/nanobench – micro-benchmark

For performance work, use Release so that the vectorized code paths and LTO flags are enabled (-O3 -march=native -flto=thin, etc.).


Public API Overview

Include the top-level header:

#include "fft/fft.h"

The main entry points (see fft/fft.h):

fft_plan* generate_complex_plan(uint32_t N);
real_plan* generate_real_plan(uint32_t N);

void fft_fwd_real(const real_plan* p,
                  const packed_real_ps& input,
                  packed_complex_ps& output);

void fft_inv_real(const real_plan* p,
                  const packed_complex_ps& input,
                  packed_real_ps& output);

void fft_fwd_complex(const fft_plan* p,
                     const packed_complex_ps& input,
                     packed_complex_ps& output);

void fft_inv_complex(const fft_plan* p,
                     const packed_complex_ps& input,
                     packed_complex_ps& output);

Data containers

The SIMD-friendly containers live in fft/common/types.h and are pulled in by fft/fft.h:

  • packed_complex_ps
  • packed_real_ps (inherits from packed_complex_ps)

Constructors you’re likely to use:

// Complex data: wrap an existing std::complex<float> array
packed_complex_ps packed(N, complex_ptr);

// Or allocate an aligned buffer of N complex floats
packed_complex_ps packed(N);

// Real data: wrap an existing float array of length N
packed_real_ps packed_real(N, float_ptr);

// Or allocate an aligned buffer for a real transform of length N
packed_real_ps packed_real(N);

Both packed_complex_ps and packed_real_ps have unpack_to(...) methods to materialize results back into standard containers.


Example Usage

Complex FFT

#include <vector>
#include <complex>
#include "fft/fft.h"

int main() {
    uint32_t N = 1u << 20; // power of two, required for now

    std::vector<std::complex<float>> data(N);
    // fill data...

    // Wrap in packed container
    packed_complex_ps in(N, data.data());
    packed_complex_ps out(N);

    // Build a plan (supports power-of-two and prime N)
    fft_plan* plan = generate_complex_plan(N);

    // Forward transform
    fft_fwd_complex(plan, in, out);

    // Backward transform (in-place or to a fresh buffer)
    packed_complex_ps recovered(N);
    fft_inv_complex(plan, out, recovered);

    recovered.unpack_to(data.data());

    delete plan;
    return 0;
}

Real FFT

#include <vector>
#include "fft/fft.h"

int main() {
    uint32_t N = 1u << 20; // must be a power of two

    std::vector<float> time_domain(N);
    // fill time_domain...

    packed_real_ps in(N, time_domain.data());
    packed_complex_ps freq(N / 2); // internally uses N/2 complex numbers

    real_plan* plan = generate_real_plan(N);

    fft_fwd_real(plan, in, freq);
    // ... process freq ...

    packed_real_ps recovered(N);
    fft_inv_real(plan, freq, recovered);
    recovered.unpack_to(time_domain.data());

    delete plan;
    return 0;
}

Running Tests and Benchmarks

Assuming you’ve built in build/:

Correctness test

./build/test

This program compares the implementation against Eigen’s FFT for a selection of sizes and reports relative error.

FLOPS benchmark

./build/flops_benchmark

This executable sweeps over a list of sizes and prints a table of approximate GFLOPS for:

  • This implementation
  • FFTW
  • PFFFT
  • Eigen FFT
  • Intel IPP (if available)

The included Performance of FFT Algorithms in GFLOPS.png is produced from runs of this benchmark.

Micro-benchmarks

./build/nanobench

Uses ankerl::nanobench to time individual kernels / transforms at very small granularities, with cache flushing between runs.


Repository Layout

  • CMakeLists.txt – top-level build configuration

  • fft/

    • fft.h, fft.cpp – public API and plan selection (power-of-two vs prime sizes)

    • common/

      • types.h – SIMD-friendly data types, alignment helpers, etc.
      • number_theoretic.* – number-theoretic helpers (Miller–Rabin, primitive roots, CRT helpers)
      • simde/ – vendored SIMDe headers for cross-platform SIMD
    • core/

      • kernels.* – base radix-2 / radix-4 scalar kernels
      • vector_kernels.* – AVX-style vectorized kernels
      • fft_pow2.* – FFT plan generation and execution for power-of-two sizes
      • thread_pool.h – placeholder for future parallelization
    • prime/

      • fft_rader.* – Rader’s algorithm for prime-length FFTs
  • flops_benchmark.cpp – GFLOPS benchmark vs FFTW / IPP / Eigen / PFFFT

  • nanobenchmark.cpp – micro-benchmark harness

  • test.cpp – correctness tests vs Eigen FFT

  • other_implementations/

    • pffft.[ch] – PFFFT reference implementation
    • nanobench.h – single-header benchmark library
    • VariadicTable.h – pretty-printing tables
    • findFFTW/FindFFTW.cmake – FFTW CMake find script
  • Performance of FFT Algorithms in GFLOPS.png – example benchmark plot


Limitations and Caveats

Since this is a side project, there are some limitations:

  • Complex plans: currently assert that N is either prime or a power of two.
  • Real plans: N must be a power of two.
  • Dimensionality: only 1-D transforms are implemented.
  • Threading: there’s a placeholder thread_pool.h, but transforms are effectively single-threaded at the moment.
  • Portability: the code is intended to be portable via SIMDe, but hasn't been tested beyond an M1 and a Ryzen 7950x

About

Yet Another FFT Library – Some experiments with SIMD Fast Fourier Transforms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages