Making Self-Hosted AI Faster: Speculative Decoding and dFlash
Self-hosted LLMs have a quiet performance problem, and it is not the one most people expect. The GPU is rarely the bottleneck. During text generation it spends most of its time waiting — shuffling the key-value cache through memory and emitting one token at a time, strictly in sequence. The compute units sit idle. This is the decode bottleneck, and it is why a card that looks fast on paper can still feel sluggish in a chat window.
Making self-hosted AI faster and more resource-efficient is therefore less about buying a bigger GPU and more about not wasting the one you already have. Speculative decoding is the single most effective lever for that — and NVIDIA just released a new open building block for it.
Speculative decoding in plain terms
The idea is draft-and-verify. A small, fast drafter model guesses several future tokens. The large target model then checks that entire guess in a single forward pass, instead of generating the tokens one by one. Tokens the target agrees with are kept; the first one it disagrees with is corrected, and drafting resumes from there.
The crucial property: the output is mathematically identical to what the target model would have produced on its own. You trade a little extra compute — running the drafter, plus verifying tokens you might throw away — for far fewer sequential steps. On decode-bound hardware, that is an excellent trade.
dFlash — what is actually new
Most speculative decoders draft one token at a time, autoregressively. dFlash, an open block-diffusion model from UC San Diego, drafts an entire block of candidate tokens in a single pass. Three ideas make it work well:
- Block-diffusion drafting — predicts multiple future tokens in parallel rather than sequentially.
- Target hidden-state conditioning — the drafter is fed context features extracted from the target model, so its guesses stay aligned.
- KV injection — target context is injected into the drafter's projections across layers, which keeps the acceptance rate high.
A higher acceptance rate means more drafted tokens survive verification, which means fewer rounds, which means more speed — at unchanged quality.
The numbers
NVIDIA benchmarked dFlash on Blackwell against plain autoregressive decoding:
The headline is up to 15× higher throughput on gpt-oss-120b at 500–600 tokens/sec per user. On other stacks: 5.8× on Gemma (vLLM, Math500) and 5.1× on Qwen3 (SGLang, Math500), while Llama 3.1 8B interactivity nearly doubled at batch size 1. Against the previous best speculative method, EAGLE-3, dFlash still delivers about 1.5× more throughput and roughly 2.3× better interactivity on average.
This is not a research-only artifact: dFlash ships in vLLM (via the Speculators library), SGLang, and TensorRT-LLM, with around 20 ready checkpoints on Hugging Face covering the Qwen, Llama, Gemma, and gpt-oss families.
Does it apply to our setup?
Honestly — with a caveat. NVIDIA's figures come from datacenter Blackwell (DGX B200/B300) and Hopper cards. Our standard inference card is the RTX 5090, which is also Blackwell-class, so the architecture lines up, and speculative decoding itself runs on essentially any modern GPU through vLLM or SGLang.
What you should not expect is the 15× headline on a single consumer card — those numbers assume large batch sizes and datacenter memory bandwidth. On a 5090 the realistic win is a meaningful latency drop for interactive, single-user or small-batch workloads. Which is to say: exactly the self-hosting case.
The wider efficiency toolkit
Speculative decoding is one lever, not the whole machine. It stacks cleanly with the techniques we already run:
- PagedAttention and continuous batching — covered in our vLLM article. This is what lets ten parallel requests cost far less than ten times one request.
- Quantization — running weights at 4-bit (see our llama.cpp and GGUF notes) shrinks the memory traffic that the decode bottleneck is made of.
- Right-sizing the model — the cheapest token is the one you never generate. A well-chosen 8B model with speculative decoding often beats a 70B model you can barely fit.
The point is to combine techniques rather than chase a single benchmark number.
A pragmatic recipe
For a mid-sized business running its own inference, the order of operations is roughly:
- Pick the smallest model that actually does the job.
- Serve it with vLLM or SGLang — you get PagedAttention and continuous batching for free.
- Apply 4-bit quantization if VRAM is tight.
- Turn on speculative decoding — a dFlash checkpoint where one exists for your model — and measure tokens/sec before and after.
Conclusion
Faster self-hosted AI is mostly an efficiency story, not a hardware story. The decode bottleneck wastes the GPU you already own; speculative decoding reclaims it without compromising output quality, and open drafters like dFlash make the technique easier to adopt across vLLM, SGLang, and TensorRT-LLM. For anyone serving LLMs on their own hardware, it is one of the highest-leverage settings you can turn on this year.