Building Production RAG + MCP Agents: A Practical Architecture Guide
A senior engineer's field guide to designing retrieval-augmented generation systems with Model Context Protocol integration — covering chunking strategies, retrieval pipelines, tool orchestration, and the pitfalls that only show up at scale.
Building Production RAG + MCP Agents: A Practical Architecture Guide
Most RAG tutorials end where real problems begin. They show you a happy path: embed some documents, stuff them into a prompt, get a response. Ship it. What they don't cover is what happens when your retrieval precision drops to 40% on domain-specific queries, when your agent hallucinates tool calls, or when your MCP server goes down mid-conversation and you need graceful degradation.
This post covers the architecture I've used to ship RAG-backed AI agents in production — with Model Context Protocol (MCP) for tool integration — and the concrete tradeoffs you'll face at every layer.
The Architecture at a Glance
A production RAG + MCP agent has four layers, and you need to get each one right independently before they compose well together.
┌─────────────────────────────────────────────┐
│ Orchestration Layer │
│ (Agent loop, planning, memory management) │
├─────────────────────────────────────────────┤
│ Retrieval Layer (RAG) │
│ (Query rewriting → Retrieval → Reranking) │
├─────────────────────────────────────────────┤
│ Tool Layer (MCP Servers) │
│ (Schema negotiation, invocation, results) │
├─────────────────────────────────────────────┤
│ Foundation Layer │
│ (LLM, embeddings, vector store, config) │
└─────────────────────────────────────────────┘
Let me walk through each layer, starting from the bottom.
Foundation Layer: Decisions That Are Hard to Reverse
Embedding model selection
Your embedding model is load-bearing infrastructure. Changing it later means re-embedding your entire corpus, which is expensive and risky (you'll have a period where old and new embeddings coexist). Pick deliberately.
What I've learned:
- Don't default to OpenAI's
text-embedding-3-smallunless you've benchmarked it against your domain. For technical documentation,text-embedding-3-largewithdimensions=1024often significantly outperforms the small variant on precision — and the cost difference is negligible at most scales. - Test on your actual queries, not MTEB leaderboards. I've seen models that rank #3 on MTEB perform terribly on internal API documentation because the training data skew didn't match our domain.
- Normalize embeddings at index time. Cosine similarity on normalized vectors is equivalent to dot product, and dot product is faster. Most vector stores optimize for this.
Vector store selection
PostgreSQL with pgvector is the right default for most teams. You already run Postgres. You already have backup/restore. You already have connection pooling. You don't need a separate vector database until you're past ~10M vectors or need sub-10ms p99 retrieval.
When you do need to scale beyond pgvector:
- Qdrant if you need rich filtering + payload storage alongside vectors
- Weaviate if your team wants a managed, schema-enforced experience
- Pinecone if you want fully managed and don't mind vendor lock-in
Retrieval Layer: Where Most RAG Systems Actually Fail
The retrieval layer is where I've seen the most production incidents. Not because retrieval is hard to set up — it's hard to set up well.
Chunking strategy
Chunking is the most impactful and least discussed decision in RAG architecture.
The naive approach (fixed-size character splits) will fail you. Here's why: a 512-token chunk that splits a code example in half, or separates a heading from its content, produces embeddings that represent noise. Your retrieval will return fragments that the LLM can't reason about.
What works in practice:
-
Semantic chunking with structural awareness. Parse your documents into an AST or structure tree first. For Markdown, this means splitting on headers while keeping the header attached to its content. For code, this means function-level or class-level boundaries.
-
Overlapping windows with parent-child relationships. Chunk at ~400 tokens with ~50 token overlap. Store both the chunk and a reference to its parent section. When a chunk is retrieved, you can optionally expand to the parent for more context.
-
Metadata-enriched chunks. Each chunk should carry: source document ID, section path (e.g., "API Reference > Authentication > OAuth2"), chunk index within the section, and any relevant tags. This metadata enables hybrid retrieval (vector + filter) which dramatically improves precision.
The retrieval pipeline
Simple cosine similarity against your vector store is a starting point, not a solution. A production retrieval pipeline looks like this:
User Query
│
▼
Query Rewriting (LLM-based)
│ - Decompose complex queries into sub-queries
│ - Expand acronyms and domain terms
│ - Generate hypothetical answer (HyDE) for better embedding match
│
▼
Hybrid Retrieval
│ - Dense: vector similarity (top-50)
│ - Sparse: BM25 keyword match (top-50)
│ - Merge via Reciprocal Rank Fusion
│
▼
Reranking (cross-encoder)
│ - Score top candidates with a cross-encoder model
│ - Much more accurate than bi-encoder similarity alone
│ - Expensive — only run on merged candidate set
│
▼
Context Assembly
│ - Deduplicate overlapping chunks
│ - Order by relevance, then by document position
│ - Trim to context window budget
│
▼
Final Context (passed to LLM)
The query rewriting step alone typically improves retrieval precision by 15-25%. Users write ambiguous, incomplete queries. The LLM can reformulate "how do I fix the auth thing" into "How do I resolve OAuth2 token refresh failures in the authentication middleware?"
Reranking: the most underused technique
Cross-encoder reranking is the single highest-ROI improvement you can add to a RAG system. Bi-encoder retrieval (your embedding similarity search) is fast but lossy — it compresses query and document into fixed-size vectors independently. A cross-encoder sees both together and can reason about their relationship.
I use Cohere's rerank-v3.5 or a self-hosted bge-reranker-v2-m3 depending on latency and cost constraints. The reranker runs on the top-50 to top-100 candidates from hybrid retrieval and returns a re-scored list. You take the top 5-10.
Tool Layer: MCP Integration
Model Context Protocol gives your agent a standardized way to discover and invoke external tools. The key insight is that MCP separates tool discovery from tool invocation, which means your agent can dynamically adapt to available capabilities.
MCP server architecture
In production, I run MCP servers as sidecar processes (or separate containers in Kubernetes). Each server exposes a focused set of tools:
Agent Process
│
├── MCP Server: "docs-search"
│ Tools: search_docs, get_doc_by_id, list_sections
│
├── MCP Server: "code-tools"
│ Tools: search_codebase, read_file, list_directory
│
└── MCP Server: "api-actions"
Tools: create_ticket, update_status, query_metrics
Each MCP server should be a bounded context. Don't create one server with 50 tools — the model's tool selection accuracy degrades as the tool count increases. Group related tools into focused servers.
Tool schema design
The quality of your tool schemas directly determines how well the LLM selects and parameterizes tool calls.
Principles that matter:
- Descriptions are prompts. Your tool description isn't documentation for humans — it's a prompt for the LLM. Be explicit about when to use the tool and what it returns.
- Constrain parameters. Use enums instead of free-form strings where possible. Specify formats. Include examples in the description.
- Return structured data. Don't return raw HTML or unstructured text from tools. Return JSON that the LLM can reason about. Include metadata like result count, confidence scores, and pagination tokens.
Handling MCP failures gracefully
MCP servers will go down. Tools will timeout. The agent needs to handle this without crashing or hallucinating.
My approach:
- Circuit breakers per MCP server. If a server fails 3 times in 60 seconds, mark it as unavailable and remove its tools from the agent's available set.
- Timeout budgets. Each tool call gets a timeout (typically 10-30 seconds). If it exceeds the timeout, return a structured error to the agent — don't let it hang.
- Fallback tool descriptions. When a tool is unavailable, inject a note into the system prompt: "The docs-search tools are currently unavailable. Answer based on your training data and note any uncertainty."
Orchestration Layer: The Agent Loop
The orchestration layer ties retrieval and tools together into a coherent agent loop.
The basic loop
while not done:
1. Assess: What do I know? What do I need?
2. Plan: Which tool or retrieval action gets me closer?
3. Act: Execute the chosen action
4. Observe: Process the result
5. Reflect: Did this help? Do I need to adjust?
6. Respond or continue
The critical design decision is when to stop. Agents that loop indefinitely waste tokens and user patience. I enforce:
- Max iterations (typically 5-8 for complex queries)
- Token budget (track cumulative tokens and stop if approaching limits)
- Diminishing returns detection (if the last 2 retrieval actions returned content the agent already had, stop retrieving and synthesize)
Memory management
For multi-turn conversations, you need a memory strategy:
- Short-term (conversation buffer): Last N messages, sliding window. Simple and sufficient for most cases.
- Working memory (scratchpad): Key facts extracted during the conversation. Persists across turns even as the conversation buffer slides.
- Long-term (retrieval-backed): For agents that serve repeat users, store summaries of past conversations and retrieve relevant ones.
The working memory scratchpad is the most underappreciated pattern. After each tool call or retrieval, have the agent extract and store key facts in a structured scratchpad. This prevents the agent from re-retrieving information it already found 3 turns ago.
Pitfalls I've Hit in Production
1. Context window stuffing. Shoving all retrieved chunks into the prompt regardless of relevance. This actually degrades LLM performance — irrelevant context is worse than no context. Always rerank and trim.
2. Embedding drift. Your documents change over time but your embeddings don't auto-update. Build an incremental re-embedding pipeline that tracks document checksums and re-embeds on change.
3. Tool call loops. The agent calls a tool, gets a result, decides it needs more info, calls the same tool with slightly different params, gets a similar result, and loops. Break this with deduplication of tool call results and the max-iteration guard.
4. Retrieval–generation mismatch. Your retrieval returns relevant chunks, but the LLM ignores them and answers from parametric knowledge. This happens more often than you'd expect. Mitigation: use system prompt instructions that explicitly tell the model to ground its answers in the provided context and cite sources.
5. MCP schema version skew. Your agent expects tool v2 but the MCP server deployed tool v3 with breaking parameter changes. Always version your tool schemas and handle unknown parameters gracefully.
Production Readiness Checklist
Before shipping a RAG + MCP agent:
- [ ] Retrieval precision measured on a golden test set (≥50 query–answer pairs from real users)
- [ ] End-to-end latency budget defined (p50, p95, p99 targets)
- [ ] Circuit breakers on all external calls (MCP servers, embedding API, reranker)
- [ ] Token usage tracking and alerting (cost can spike unexpectedly with agent loops)
- [ ] Evaluation pipeline that runs on every deployment (retrieval recall, answer quality, tool selection accuracy)
- [ ] Graceful degradation tested: what happens when vector store is down? When MCP server is down? When LLM is rate-limited?
- [ ] PII handling: ensure retrieved chunks don't leak sensitive data across tenant boundaries
- [ ] Observability: structured logging of every retrieval query, tool call, and LLM interaction with trace IDs
Closing Thoughts
The gap between a RAG demo and a production RAG agent is enormous. The demo is a weekend project. The production system is months of iteration on chunking strategies, retrieval pipelines, tool schemas, error handling, and evaluation methodology.
Start with the retrieval layer. Get your chunking and reranking right before adding tools. Measure retrieval precision before measuring end-to-end answer quality. And invest in evaluation infrastructure early — it's the only way to confidently iterate without regressing.
The MCP ecosystem is still maturing, but the protocol's separation of discovery and invocation is the right abstraction. Build your tool layer on it, and you'll be well-positioned as the ecosystem grows.