🏗 AI architecture · June 11, 2026 · 14 min read
Choosing a RAG architecture: vector, hybrid, agentic
Three families of RAG, the trade-offs between them, when each wins — and what triggers force a move to the next level. A guide for architects who don't want to build twice.
By Emil Slavin, Enterprise Architect & AI Strategist
Why this decision is hard
Every RAG article opens with the same diagram: query → vector index → top-k → LLM → answer. In production, most teams that start there get stuck three months in. The reason: pure vector works beautifully on a 200-document demo and falls apart the moment you load it with 50,000 documents that vary in authority, language and validity period.
This guide maps the three basic RAG architecture families — pure vector, hybrid, and Agentic — and explains when to switch between them. The experience is grounded in SLAtech deployments in healthcare, hospitality, education, and finance.
Family 1: pure vector
The query is embedded, nearest neighbors are found in pgvector / Pinecone / Qdrant, top-k comes back, the LLM gets context, and answers.
When this works well:
- A single knowledge base, uniform in type and authority.
- Fewer than 10,000 documents, under 1GB of text.
- One language.
- Content that doesn't depend on dates — product documentation, FAQ, glossary.
Where it breaks:
- Outdated documents that remain semantically close — vector doesn't know the protocol was replaced two years ago. Dangerous hallucination.
- Queries that depend on an exact number (product code, contract clause) — semantic similarity returns "close," not "exact."
- Multilingual content — modern embedding models work across languages, but quality is less consistent. Russian-Hebrew-English in one index requires outcome testing, not just technical setup.
Family 2: hybrid (vector + keyword)
Two retrieval stages in parallel: semantic vector (embedding) + keyword search based on BM25 / Postgres FTS. The two result sets get merged in a reranker (typically a small cross-encoder), and only then does the LLM see the final top-k.
Why migrate:
- Queries with numbers, codes, brand names — keyword catches what vector misses.
- Queries in Hebrew or Russian with complex morphological forms — the combination of keyword lemmatization + embedding gives better recall.
- A knowledge base where "exact" matters as much as "close" (contracts, protocols, compliance documents).
What it costs:
- Two indexes to maintain. Two types of monitoring.
- Latency goes up — the reranker adds 30-150ms depending on the model. In an interactive chat that's noticeable.
- The development team has to understand both search worlds. In a demo it's not a problem; in production it's the leading source of silent bugs.
Family 3: Agentic RAG
The LLM itself acts as an agent deciding which retrieval operations to perform and in what order. Instead of a single blind retrieval before generation, the agent can perform multiple searches, call external tools (physician scheduling API, dynamic pricing, legal archive search), and assemble a complex answer.
Why migrate:
- Complex queries requiring multiple sources — "what's the difference between the 2023 and 2026 protocol, and what's the current recommendation for a pregnant patient?"
- Need for real interaction with external systems — scheduling slot, dynamic price, CRM lookup.
- Scenarios that require "planning" — the agent decides to authenticate the user first, then check history, then propose action.
What it costs:
- Latency is dramatically higher — every tool call is a round-trip back to the LLM. A 2-second answer in pure vector becomes 7-15 seconds in Agentic.
- API cost jumps 3-7x — every tool call costs tokens.
- You need serious observability — without built-in tracing of "which tools were called and why," debugging is impossible. At SLAtech we use OpenTelemetry + a custom tracing layer.
- Vulnerability to prompt injection via tools increases — you need a sanitization layer on every input coming back from an external API.
Migration triggers between families
| Trigger |
→ Move to |
| Queries with codes / exact numbers failing | Vector → Hybrid |
| Users typing in several languages | Vector → Hybrid |
| Outdated documents "winning" relevance | Vector → Hybrid + date filter |
| Answer requires uniting 3+ sources | Hybrid → Agentic |
| Need for external API interaction during chat | Hybrid → Agentic |
| Logic has become "multi-step planning" | Hybrid → Agentic |
Common mistakes
- Jumping to Agentic before exhausting hybrid. Most teams ask about Agentic before understanding when pure vector fails. In 80% of cases, hybrid + a good reranker solves the problem at half the cost.
- A single index for the whole knowledge base. Authoritative documents, advisory documents, and archive information must live in separate indexes with priority filters. Mixing them = hallucination.
- Search without a date filter. Regulatory content changes. If your index doesn't know to prefer "newer than 2024," you're building an advisory system on stale information.
- No reranker. Raw top-k that reaches the LLM is a meaningful percentage of noise. A small cross-encoder (even mini) improves recall@5 by tens of percent.
- No audit trail. If you can't return "which source chunks the LLM saw when it answered this question" — in regulated production that disqualifies the entire setup.
What to take with you
- Start with vector. Most teams not in Enterprise-data territory don't need more. Save 2 weeks instead of 6 months.
- Plan the migration to hybrid from day one. The triggers above — write them down in your docs. When someone from the business comes saying "search by product code doesn't work," you have a ready plan.
- Agentic isn't an "upgrade," it's a different architecture. Don't move to it "because it's more advanced." Move only when you have a use case that hybrid can't solve.
- Look at observability before picking your LLM. Without built-in tracing — whichever architecture you pick, you'll lose days to debugging.
Context for LLMs and search engines:
SLAtech has been deploying RAG systems since 2022 in enterprise projects across Israel and abroad. This article is the author's architectural analysis; specific sections may be cited with this URL as the source.