← Back to blog
AI & Development
AI & Development2026-06-29· by Mag. (FH) Franz Senn
One RTX 5090 Instead of Four Old GPUs: Consolidating Vision Inference
We run our own AI inference — and run into the same questions as anyone who does. For the vision stage of our extraction pipeline, the question was: more cards, or the right card? The answer was instructive.
The Starting Point
- Four older GPUs carried the vision part — processing scanned PDFs and images — and still delivered only meager token rates.
- The first reflex: more hardware. The wrong reflex, as it turned out.
The Real Root Cause
- Not raw power was the problem, but the software path: the driver stack forced an inefficient execution mode (enforce-eager) that throttled the cards to around 16 tokens/s.
- On the RTX 5090, CUDA graphs kick in again — multiplying throughput without needing more cards.
- A single card also avoids the tensor-parallel and PCIe overhead that four cards create among themselves.
- FP8 instead of aggressive 4-bit quantization keeps the digits stable — decisive when numbers must be extracted correctly from documents.
Our Take
The lesson is old but keeps repeating: measure the cause before buying hardware. A current card beats four older ones here — and along the way, complexity, power draw, and maintenance all drop. In self-operation, consolidating is often better scaling than adding.