One RTX 5090 Instead of Four Old GPUs: Consolidating Vision Inference

We run our own AI inference — and run into the same questions as anyone who does. For the vision stage of our extraction pipeline, the question was: more cards, or the right card? The answer was instructive.

The Starting Point

Four older GPUs carried the vision part — processing scanned PDFs and images — and still delivered only meager token rates.
The first reflex: more hardware. The wrong reflex, as it turned out.

The Real Root Cause

Not raw power was the problem, but the software path: the driver stack forced an inefficient execution mode (enforce-eager) that throttled the cards to around 16 tokens/s.
On the RTX 5090, CUDA graphs kick in again — multiplying throughput without needing more cards.
A single card also avoids the tensor-parallel and PCIe overhead that four cards create among themselves.
FP8 instead of aggressive 4-bit quantization keeps the digits stable — decisive when numbers must be extracted correctly from documents.

Our Take

The lesson is old but keeps repeating: measure the cause before buying hardware. A current card beats four older ones here — and along the way, complexity, power draw, and maintenance all drop. In self-operation, consolidating is often better scaling than adding.