What is DFlash?
DFlash is a breakthrough speculative decoding technique introduced in the paper DFlash: Block Diffusion for Flash Speculative Decoding (Feb 2025).
Instead of traditional autoregressive draft models, it uses a lightweight block diffusion model (~5 layers) that generates an entire block of draft tokens in a single forward pass via parallel denoising. The draft model is conditioned on hidden/context features extracted from the target LLM (injected into the draft model's KV cache). The target LLM then verifies the whole block in parallel.
Key results (lossless, distribution-preserving):
- >6× speedup on models like Qwen3.5, LLaMA-3.1 series across Math500, LiveCodeBench, GSM8K, etc.
- Up to 2.5× faster than the previous SOTA (EAGLE-3)
- Much higher acceptance rates and GPU utilization
Resources:
Why this fits RuVector perfectly
ruvLLM already ships state-of-the-art inference features:
- Speculative decoding (currently ~2-3× speedup, auto-detect draft models)
- FlashAttention-3 (and 50+ other attention mechanisms)
- Paged KV cache, continuous batching, quantization, Metal/CUDA/WebGPU/WASM backends
- Self-learning SONA + GNN system that learns from query feedback/trajectories
DFlash is the natural evolution of the existing speculative decoding path. A Rust-native block diffusion drafter would deliver massive additional gains especially for:
- Agentic generation & routing
- Graph RAG + synthetic data flows
- Real-time local inference (edge/browser/postgres extension)
- Self-optimizing workloads (SONA could dynamically tune block size, conditioning features, or draft parameters)
Proposed implementation
- Add native support for DFlash-style block diffusion drafting inside
crates/ruvllm and ruvector-attention
- Support loading pre-trained DFlash drafters (or convert from HF → GGUF/ONNX)
- Enable conditioning on target LLM hidden states (already possible with current KV/attention infrastructure)
- Make block size configurable + let SONA/GNN auto-adapt it from usage patterns
- Optional: expose new high-level APIs (e.g.
ruvector_dflash_generate_block(...) in the PostgreSQL extension)
Expected impact
- Significantly higher throughput and lower latency while staying 100% lossless
- Strengthens RuVector’s position as the fastest self-learning local AI memory + inference engine
- Keeps us ahead of llama.cpp / vLLM / Ollama in the speculative decoding space
Happy to help test, benchmark, or even contribute code/PRs once the direction is clear. Let me know how I can support the implementation!
What is DFlash?
DFlash is a breakthrough speculative decoding technique introduced in the paper DFlash: Block Diffusion for Flash Speculative Decoding (Feb 2025).
Instead of traditional autoregressive draft models, it uses a lightweight block diffusion model (~5 layers) that generates an entire block of draft tokens in a single forward pass via parallel denoising. The draft model is conditioned on hidden/context features extracted from the target LLM (injected into the draft model's KV cache). The target LLM then verifies the whole block in parallel.
Key results (lossless, distribution-preserving):
Resources:
Why this fits RuVector perfectly
ruvLLMalready ships state-of-the-art inference features:DFlash is the natural evolution of the existing speculative decoding path. A Rust-native block diffusion drafter would deliver massive additional gains especially for:
Proposed implementation
crates/ruvllmandruvector-attentionruvector_dflash_generate_block(...)in the PostgreSQL extension)Expected impact
Happy to help test, benchmark, or even contribute code/PRs once the direction is clear. Let me know how I can support the implementation!