Making Self-Hosted AI Faster: Speculative Decoding and dFlash

Self-hosted LLMs have a quiet performance problem, and it is not the one most people expect. The GPU is rarely the bottleneck. During text generation it spends most of its time waiting — shuffling the key-value cache through memory and emitting one token at a time, strictly in sequence. The compute units sit idle. This is the decode bottleneck, and it is why a card that looks fast on paper can still feel sluggish in a chat window.

Making self-hosted AI faster and more resource-efficient is therefore less about buying a bigger GPU and more about not wasting the one you already have. Speculative decoding is the single most effective lever for that — and NVIDIA just released a new open building block for it.

NVIDIA® und das NVIDIA-Logo sind eingetragene Marken der NVIDIA Corporation.

Speculative decoding in plain terms

The idea is draft-and-verify. A small, fast drafter model guesses several future tokens. The large target model then checks that entire guess in a single forward pass, instead of generating the tokens one by one. Tokens the target agrees with are kept; the first one it disagrees with is corrected, and drafting resumes from there.

The crucial property: the output is mathematically identical to what the target model would have produced on its own. You trade a little extra compute — running the drafter, plus verifying tokens you might throw away — for far fewer sequential steps. On decode-bound hardware, that is an excellent trade.

Speculative Decoding: Der Drafter schlägt einen ganzen Token-Block vor, das Ziel-Modell verifiziert ihn in einem Forward-Pass — akzeptierte Token bleiben, verworfene werden neu gezogen.

dFlash — what is actually new

Most speculative decoders draft one token at a time, autoregressively. dFlash, an open block-diffusion model from UC San Diego, drafts an entire block of candidate tokens in a single pass. Three ideas make it work well:

Block-diffusion drafting — predicts multiple future tokens in parallel rather than sequentially.
Target hidden-state conditioning — the drafter is fed context features extracted from the target model, so its guesses stay aligned.
KV injection — target context is injected into the drafter's projections across layers, which keeps the acceptance rate high.

A higher acceptance rate means more drafted tokens survive verification, which means fewer rounds, which means more speed — at unchanged quality.

The numbers

NVIDIA benchmarked dFlash on Blackwell against plain autoregressive decoding:

Durchsatz ggü. autoregressiver Dekodierung auf NVIDIA Blackwell — Werte aus NVIDIAs dFlash-Benchmarks.

The headline is up to 15× higher throughput on gpt-oss-120b at 500–600 tokens/sec per user. On other stacks: 5.8× on Gemma (vLLM, Math500) and 5.1× on Qwen3 (SGLang, Math500), while Llama 3.1 8B interactivity nearly doubled at batch size 1. Against the previous best speculative method, EAGLE-3, dFlash still delivers about 1.5× more throughput and roughly 2.3× better interactivity on average.

This is not a research-only artifact: dFlash ships in vLLM (via the Speculators library), SGLang, and TensorRT-LLM, with around 20 ready checkpoints on Hugging Face covering the Qwen, Llama, Gemma, and gpt-oss families.

Does it apply to our setup?

Honestly — with a caveat. NVIDIA's figures come from datacenter Blackwell (DGX B200/B300) and Hopper cards. Our standard inference card is the RTX 5090, which is also Blackwell-class, so the architecture lines up, and speculative decoding itself runs on essentially any modern GPU through vLLM or SGLang.

What you should not expect is the 15× headline on a single consumer card — those numbers assume large batch sizes and datacenter memory bandwidth. On a 5090 the realistic win is a meaningful latency drop for interactive, single-user or small-batch workloads. Which is to say: exactly the self-hosting case.

The wider efficiency toolkit

Speculative decoding is one lever, not the whole machine. It stacks cleanly with the techniques we already run:

PagedAttention and continuous batching — covered in our vLLM article. This is what lets ten parallel requests cost far less than ten times one request.
Quantization — running weights at 4-bit (see our llama.cpp and GGUF notes) shrinks the memory traffic that the decode bottleneck is made of.
Right-sizing the model — the cheapest token is the one you never generate. A well-chosen 8B model with speculative decoding often beats a 70B model you can barely fit.

The point is to combine techniques rather than chase a single benchmark number.

A pragmatic recipe

For a mid-sized business running its own inference, the order of operations is roughly:

Pick the smallest model that actually does the job.
Serve it with vLLM or SGLang — you get PagedAttention and continuous batching for free.
Apply 4-bit quantization if VRAM is tight.
Turn on speculative decoding — a dFlash checkpoint where one exists for your model — and measure tokens/sec before and after.

Conclusion

Faster self-hosted AI is mostly an efficiency story, not a hardware story. The decode bottleneck wastes the GPU you already own; speculative decoding reclaims it without compromising output quality, and open drafters like dFlash make the technique easier to adopt across vLLM, SGLang, and TensorRT-LLM. For anyone serving LLMs on their own hardware, it is one of the highest-leverage settings you can turn on this year.