Feature Request: Integrate DFlash Block Diffusion Speculative Decoding

### What is DFlash?
DFlash is a breakthrough speculative decoding technique introduced in the paper **[DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036)** (Feb 2025).  

Instead of traditional autoregressive draft models, it uses a **lightweight block diffusion model** (~5 layers) that generates an entire block of draft tokens **in a single forward pass** via parallel denoising. The draft model is conditioned on hidden/context features extracted from the target LLM (injected into the draft model's KV cache). The target LLM then verifies the whole block in parallel.

**Key results** (lossless, distribution-preserving):
- **>6× speedup** on models like Qwen3.5, LLaMA-3.1 series across Math500, LiveCodeBench, GSM8K, etc.
- Up to **2.5× faster** than the previous SOTA (EAGLE-3)
- Much higher acceptance rates and GPU utilization

**Resources**:
- Paper: https://arxiv.org/abs/2602.06036
- GitHub: https://github.com/z-lab/dflash (open-source, SGLang + vLLM support)
- Pre-trained drafters on Hugging Face

### Why this fits RuVector perfectly
`ruvLLM` already ships state-of-the-art inference features:
- Speculative decoding (currently ~2-3× speedup, auto-detect draft models)
- FlashAttention-3 (and 50+ other attention mechanisms)
- Paged KV cache, continuous batching, quantization, Metal/CUDA/WebGPU/WASM backends
- Self-learning SONA + GNN system that learns from query feedback/trajectories

DFlash is the natural evolution of the existing speculative decoding path. A Rust-native block diffusion drafter would deliver massive additional gains especially for:
- Agentic generation & routing
- Graph RAG + synthetic data flows
- Real-time local inference (edge/browser/postgres extension)
- Self-optimizing workloads (SONA could dynamically tune block size, conditioning features, or draft parameters)

### Proposed implementation
1. Add native support for DFlash-style **block diffusion drafting** inside `crates/ruvllm` and `ruvector-attention`
2. Support loading pre-trained DFlash drafters (or convert from HF → GGUF/ONNX)
3. Enable conditioning on target LLM hidden states (already possible with current KV/attention infrastructure)
4. Make block size configurable + let SONA/GNN auto-adapt it from usage patterns
5. Optional: expose new high-level APIs (e.g. `ruvector_dflash_generate_block(...)` in the PostgreSQL extension)

### Expected impact
- Significantly higher throughput and lower latency while staying 100% lossless
- Strengthens RuVector’s position as the fastest self-learning local AI memory + inference engine
- Keeps us ahead of llama.cpp / vLLM / Ollama in the speculative decoding space

Happy to help test, benchmark, or even contribute code/PRs once the direction is clear. Let me know how I can support the implementation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Integrate DFlash Block Diffusion Speculative Decoding #341

What is DFlash?

Why this fits RuVector perfectly

Proposed implementation

Expected impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature Request: Integrate DFlash Block Diffusion Speculative Decoding #341

Description

What is DFlash?

Why this fits RuVector perfectly

Proposed implementation

Expected impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions