llama.cpp & GGUF: LLMs locally, efficiently, without a cluster

Not every inference job needs high throughput. Sometimes a single request, local, without API, on the workstation is enough. That’s exactly where llama.cpp shines.

GGUF: Quantization that works

GGUF is the file format llama.cpp uses — and the quantization method that dominates the scene. From a 70B model with 140 GB FP16, Q4_K_M yields a 40 GB model that fits on an RTX 5090 (32 GB) with partial CPU offload.

Quantization isn’t magic, but it’s far better than the numbers suggest. Q4 loses barely any recognition quality; Q2 is noticeable. We standardly use Q5_K_M for serious work and Q4_K_M when VRAM is tight.

CPU/GPU offload

llama.cpp can offload individual layers to the GPU and compute the rest on the CPU. This isn’t a high-performance setup, but it works — and it means: you can test large models before buying hardware.

./llama-cli -m model-q4_k_m.gguf -ngl 40 -c 8192

-ngl 40 means: 40 layers on GPU, rest on CPU. That’s the bridge between “doesn’t fit” and “fits, if I want it to.”

The counterpoint to vLLM

vLLM (see previous article) optimizes for throughput — many parallel requests, server operation. llama.cpp optimizes for flexibility: single requests, various quantizations, CPU fallback. Both have the same output: an LLM responds. The use case decides.

Rule of thumb: vLLM as an inference server for tools and teams. llama.cpp for development, local testing, and the one special case that doesn’t fit server architecture.

Conclusion

llama.cpp is the Swiss Army knife of local LLM inference. It doesn’t make a server, but it makes everything possible — and it has the charm of running on a workstation that’s been under the desk for ages.

FAQ

How much does GGUF quantization reduce model quality?+

Quantization is far better than the numbers suggest. Q4 loses barely any recognition quality; Q2 is noticeable. From a 70B model at 140 GB FP16, Q4_K_M yields a 40 GB model that fits on an RTX 5090 with partial CPU offload. For serious work we standardly use Q5_K_M, and Q4_K_M when VRAM is tight.

Can I test large models before buying the hardware for them?+

Yes. llama.cpp can offload individual layers to the GPU and compute the rest on the CPU. This is not a high-performance setup, but it works and is the bridge between "doesn't fit" and "fits, if I want it to." That way you test large models on the workstation already under your desk, without having to procure expensive hardware first.

When do I use llama.cpp and when vLLM?+

vLLM optimizes for throughput – many parallel requests, server operation for tools and teams. llama.cpp optimizes for flexibility: single requests, various quantizations, and CPU offload as a fallback. Both deliver the same output. Rule of thumb: vLLM as an inference server, llama.cpp for development, local testing, and the special case that doesn't fit server architecture.