Skip to content

Feature Request: Integrate DFlash Block Diffusion Speculative Decoding #341

@shaal

Description

@shaal

What is DFlash?

DFlash is a breakthrough speculative decoding technique introduced in the paper DFlash: Block Diffusion for Flash Speculative Decoding (Feb 2025).

Instead of traditional autoregressive draft models, it uses a lightweight block diffusion model (~5 layers) that generates an entire block of draft tokens in a single forward pass via parallel denoising. The draft model is conditioned on hidden/context features extracted from the target LLM (injected into the draft model's KV cache). The target LLM then verifies the whole block in parallel.

Key results (lossless, distribution-preserving):

  • >6× speedup on models like Qwen3.5, LLaMA-3.1 series across Math500, LiveCodeBench, GSM8K, etc.
  • Up to 2.5× faster than the previous SOTA (EAGLE-3)
  • Much higher acceptance rates and GPU utilization

Resources:

Why this fits RuVector perfectly

ruvLLM already ships state-of-the-art inference features:

  • Speculative decoding (currently ~2-3× speedup, auto-detect draft models)
  • FlashAttention-3 (and 50+ other attention mechanisms)
  • Paged KV cache, continuous batching, quantization, Metal/CUDA/WebGPU/WASM backends
  • Self-learning SONA + GNN system that learns from query feedback/trajectories

DFlash is the natural evolution of the existing speculative decoding path. A Rust-native block diffusion drafter would deliver massive additional gains especially for:

  • Agentic generation & routing
  • Graph RAG + synthetic data flows
  • Real-time local inference (edge/browser/postgres extension)
  • Self-optimizing workloads (SONA could dynamically tune block size, conditioning features, or draft parameters)

Proposed implementation

  1. Add native support for DFlash-style block diffusion drafting inside crates/ruvllm and ruvector-attention
  2. Support loading pre-trained DFlash drafters (or convert from HF → GGUF/ONNX)
  3. Enable conditioning on target LLM hidden states (already possible with current KV/attention infrastructure)
  4. Make block size configurable + let SONA/GNN auto-adapt it from usage patterns
  5. Optional: expose new high-level APIs (e.g. ruvector_dflash_generate_block(...) in the PostgreSQL extension)

Expected impact

  • Significantly higher throughput and lower latency while staying 100% lossless
  • Strengthens RuVector’s position as the fastest self-learning local AI memory + inference engine
  • Keeps us ahead of llama.cpp / vLLM / Ollama in the speculative decoding space

Happy to help test, benchmark, or even contribute code/PRs once the direction is clear. Let me know how I can support the implementation!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions