RAG Environments Compared: Vector Databases, Chunking, Embeddings

An LLM answers questions only as well as its context allows. RAG — Retrieval-Augmented Generation — gives it exactly that: relevant documents before formulating an answer. The technique is powerful, but full of pitfalls.

The RAG Pipeline

Load documents — PDF, Markdown, HTML, code
Chunking — Split documents into meaningful sections
Embedding — Convert text sections into vectors
Store — Persist vectors in a specialized database
Query — On user question, find similar vectors and provide them as context

Each step influences whether the answer is correct or hallucinated.

RAG: Anfrage → Embedding → relevante Treffer aus der Vektor-DB → LLM → Antwort mit Quellen

Vector Database: Why Qdrant?

Qdrant is written in Rust, speaks gRPC and REST, and runs as a single binary. It needs no distributed cluster for medium-sized data sets and delivers sub-millisecond query times.

The filter API is the underrated advantage: you can filter by metadata in addition to vector similarity ranking — for example, "only documents from 2025" or "only from customer X." This makes retrieval more precise than pure vector search.

Chunking: The Underestimated Lever

Too-large chunks dilute the context. Too-small chunks tear sentences apart. The right size depends on the embedding model — typically 512 to 1024 tokens with overlap.

from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)

What matters is the overlap: a sentence that starts in the middle of a chunk loses context. 10–20% overlap is the pragmatic standard.

Embedding Models

For German mixed texts, multilingual models like intfloat/multilingual-e5-large deliver the most stable results. OpenAI embeddings are good, but for sensitive data they’re off-limits — embeddings contain semantic information and should not leave the premises.

Reality: RAG Is No Button

RAG works — but it requires trial and error. Which chunk size, which model, which retrieval strategy (dense, sparse, hybrid) depends on the document type. The first 80% accuracy is easy; the last 20% takes work.

Conclusion

RAG is not a product but an architecture. Qdrant makes the vector side simple. The rest — chunking, prompt design, evaluation — remains engineering work.

FAQ

Do our documents have to be uploaded to OpenAI for RAG?+

No. For German-mixed texts a multilingual model like multilingual-e5-large delivers stable results locally. OpenAI embeddings are off-limits for sensitive data, because embeddings contain semantic information and should not leave the premises. The entire pipeline can run in your own network with Qdrant and self-hosted models, keeping the data in-house.

Is RAG a finished product you just switch on?+

No, RAG is an architecture, not a product. The first 80 percent accuracy is easy; the last 20 percent takes work. Which chunk size, which embedding model, and which retrieval strategy — dense, sparse, or hybrid — fits depends on the document type and requires trial and error. Chunking, prompt design, and evaluation remain engineering work.

Why Qdrant instead of a large database cluster?+

Qdrant is written in Rust, speaks gRPC and REST, and runs as a single binary — for medium-sized data sets it needs no distributed cluster and delivers sub-millisecond query times. The underrated advantage is the filter API: alongside vector ranking you can filter by metadata, such as only documents from 2025 or only from customer X, making retrieval more precise.