All Insights
Engineering

RAG Architecture with Claude: Patterns That Actually Scale

2026-05-06870 words4 min read

**DRAFT — pending editorial expansion.** This article is a working draft published as scaffolding for the NINtec content programme. The current version covers the substantive perspective in compressed form; the published version will expand each section to the 2,000+ word depth the topic warrants. Editorial review is required before promotion.

RAG is the most common architectural pattern in enterprise Claude deployments. It is also the most common deployment that crumbles between pilot and production. The pattern is deceptively simple — embed documents, retrieve relevant chunks, give them to Claude as context — but the production-engineering surface around the simple pattern is large. This piece covers the architectural decisions that determine whether your RAG system survives the long tail.

Chunking strategy is the throttle

Most RAG demos chunk by token count — a fixed window of N tokens, sliding across the document. This works in demos and fails in production because document semantics matter. Legal contracts chunk by clause, not by token count. Medical guidelines chunk by recommendation. Engineering documentation chunks by section. The right chunking strategy is structure-aware, and the structure varies per document type.

We start with a structure-aware chunker — sections, tables, code blocks, citations all respected — and tune chunk size against retrieval evals. Token-count-only chunkers are starting points we always move past.

Embedding selection is workload-specific

Benchmark numbers do not predict your specific corpus performance. We evaluate embedding models against the actual corpus in Discovery — Cohere embed-v3, OpenAI text-embedding-3, Voyage, BGE — and pick on retrieval metrics, not on benchmarks. The differential between the best and worst embedding for a given corpus is often 15–25% retrieval recall, which compounds into materially better answer quality.

Hybrid retrieval rescues the cases dense misses

Pure dense (vector) retrieval misses queries where keyword matching matters — product codes, error strings, named entities. Pure keyword retrieval misses semantic-paraphrase queries. Hybrid retrieval combines both: dense vector search plus BM25 keyword search, with a re-ranking step (cross-encoder model) on the merged candidate set. This is now the production standard for serious RAG systems; we ship it by default.

Citation discipline as defence against hallucination

Claude follows citation instructions reliably when given the right prompt patterns. Production RAG prompts instruct Claude to cite the chunk number after each claim, refuse to answer when the context does not contain the answer, and explicitly mark uncertainty. Hallucination-detection layers compare answer claims to retrieved context and flag ungrounded outputs. Together these reduce — not eliminate — hallucination risk to acceptable levels for regulated workloads.

Eval as a CI signal

Production RAG cannot be deployed without continuous evaluation. The eval set is golden questions with expected citations. Retrieval recall@5, retrieval precision@5, and end-to-end answer quality scored by a judge LLM are the metrics. Drift alerts fire when metrics fall below threshold. Production prompt and chunking changes block on eval-bar parity. Without eval discipline, the system regresses silently between releases.

Corpus update pipelines

Documents change. Production RAG includes event-driven re-indexing on document create/update/delete, deletion-aware vector store hygiene (orphan vectors deleted), and corpus versioning so changes can be rolled back. Without this, the index drifts from the source documents over time and answer quality degrades silently.

When RAG is not the answer

If your corpus fits in Claude's 200K context window and the cost economics favour context-stuffing, RAG might be over-engineering. Anthropic's prompt caching makes context-stuffing dramatically cheaper for repeated context. The Discovery phase identifies whether RAG is necessary or whether a simpler architecture works.

How NINtec engages on RAG

Typical RAG engagement runs 8–14 weeks: 1–2 weeks Discovery (corpus walk-through, eval set construction, retrieval-quality target negotiation), 6–10 weeks Build (iterative with weekly retrieval-quality readouts), 1–2 weeks Hardening (long-tail adversarial testing, document-update churn drills). The eval-set quality is the throttle on subsequent iteration speed; investing in good evals early pays back through every later optimisation cycle.

Ready to Engineer at the Speed of Light?