Field manual
RAG Systems Explained: When to Use RAG vs Fine-Tuning for Business AI
A pragmatic guide to RAG vs fine-tuning for business AI: decision criteria, architecture, cost/latency trade-offs, a comparison table, and production checklists.
Why RAG (and why most chatbots fail)
Most “AI chatbots” fail because they:
- answer without grounding (hallucinations)
- lack access to your docs / tickets / database
- can’t cite sources or explain the reasoning path
- break under real-world constraints (latency, cost, privacy, access control)
RAG (Retrieval-Augmented Generation) solves this by retrieving relevant, up-to-date context and generating answers grounded in your data.
If you’re evaluating what to build, start here: Services (or Contact if you want to discuss your case).
The minimal RAG architecture
At a high level:
- Ingest: pull documents from sources (Drive, Confluence, Notion, DB exports, PDFs)
- Chunk: split into retrievable units
- Embed + index: store vectors + metadata
- Retrieve: semantic + keyword/hybrid search
- Re-rank: improve relevance (cross-encoder or LLM-based)
- Generate: answer with citations + guardrails
Chunking: the highest-leverage decision
Chunking is a product decision as much as a technical one.
Good chunking rules of thumb:
- chunk by structure (headings, sections, Q/A), not by raw character count
- keep metadata (doc title, section path, timestamps, ACL)
- store source URLs so you can cite and deep-link
Retrieval patterns that hold up in production
Hybrid search
Use a blend of:
- BM25 / keyword search (great for exact phrases, IDs)
- vector search (great for meaning)
Re-ranking
Re-ranking is how you turn “okay” retrieval into “trustworthy” retrieval.
Guardrails: make answers safer and more useful
- Citations: show which sources were used
- Refusal mode: “I don’t know based on available docs”
- Follow-up questions: ask for missing context
- Access control: enforce per-user permissions at retrieval time
Evaluation (how to know it works)
Measure:
- retrieval quality (recall@k, MRR)
- answer correctness (graded tests)
- citation accuracy
- latency + cost
Cost & latency trade-offs
RAG isn't free. Every query touches an embedding model, a vector DB, a re-ranker, and an LLM. Here's what that looks like in practice:
| Component | Latency (p50) | Cost per 1k queries | Notes |
|---|---|---|---|
| Embedding (text-embedding-3-small) | 50-100ms | ~$0.02 | Batch embed at ingest; cache query embeddings |
| Vector search (pgvector / Pinecone) | 10-50ms | ~$0.01 | Scales with index size; use HNSW for >100k chunks |
| Re-ranking (cross-encoder) | 100-300ms | ~$0.05-0.15 | Optional but dramatically improves precision |
| LLM generation (GPT-4o-mini) | 500-1500ms | ~$0.15-0.30 | The dominant cost; tune max_tokens ruthlessly |
| Total (typical) | 700-2000ms | ~$0.23-0.48 | Without caching or optimization |
Cost levers:
- Cache aggressively: embed query → cache for 1h TTL. 30-60% of queries are repeats.
- Use smaller models for routing: a cheap classifier can determine if a query needs RAG or can be answered from cache.
- Tune chunk size: smaller chunks = more precise retrieval but more LLM context tokens. 256-512 tokens is the sweet spot for most use cases.
- Stream responses: don't make the user wait for the full answer. Stream tokens as they arrive.
Common RAG failure modes (and how to fix them)
1. "The answer is wrong but sounds confident"
Cause: Retrieval returned irrelevant chunks, but the LLM hallucinated a plausible-sounding answer anyway.
Fix: Add a relevance threshold. If the top chunk's similarity score is below 0.7, return "I don't have enough information" instead of generating. This is non-negotiable for production.
2. "It works in testing but breaks in production"
Cause: Test queries matched training data. Production queries are messier — typos, slang, mixed languages, ambiguous references.
Fix: Build a golden test set from real production queries (sanitized). Run it on every pipeline change. Track retrieval quality (recall@k) separately from answer quality.
3. "Latency is fine until we hit 100k documents"
Cause: Vector search degrades non-linearly with index size if you're using brute-force IVF. HNSW indexes maintain ~O(log n) query time.
Fix: Switch to HNSW. If using pgvector, set m=16, ef_construction=64 as a starting point. For >1M chunks, consider Pinecone or Weaviate.
4. "Users can see documents they shouldn't"
Cause: Vector search returns chunks by similarity, ignoring access control. A shared index means everyone sees everything.
Fix: Store ACL metadata in each chunk. Filter at retrieval time — never at generation time. If using pgvector, add a WHERE user_id = $1 OR doc_acl @> ARRAY[$1] clause to the vector search query.
5. "The same question gets different answers"
Cause: No caching layer. Each query hits the full pipeline, and LLM temperature introduces variance.
Fix: Cache the final answer (not just the embedding) with a 1-6h TTL. For deterministic outputs (e.g., "what's our refund policy?"), set temperature to 0.
Production checklist
Before shipping a RAG system to production, verify:
Retrieval
- HNSW index configured (not brute-force)
- Hybrid search (BM25 + vector) enabled
- Re-ranking on top-k candidates (cross-encoder or LLM-based)
- Relevance threshold enforced (refuse if below 0.7)
- Access control filters applied at query time
Generation
- Citations included in every response
- Refusal mode working ("I don't know based on available docs")
- Max tokens capped (prevent runaway responses)
- Temperature set appropriately (0 for factual, 0.3 for conversational)
- Content filtering on both input and output
Infrastructure
- Query embedding cache (Redis, 1h TTL)
- Response cache for deterministic queries
- Sentry / error tracking on the full pipeline
- Latency budget per stage (alert if p95 > 2x baseline)
- Cost monitoring and alerts (per 1k queries)
Evaluation
- Golden test set (20+ queries with expected answers)
- Retrieval metrics tracked (recall@k, MRR)
- Answer quality graded (human or LLM-as-judge)
- Citation accuracy measured
- Regression tests run on every pipeline change
Code example: hybrid search with pgvector
// Hybrid search: BM25 (keyword) + vector similarity
// Requires pgvector extension and a tsvector column
async function hybridSearch(
query: string,
queryEmbedding: number[],
opts: { limit?: number; userId?: string } = {}
) {
const { limit = 5, userId } = opts;
// Vector similarity search with L2 distance
const vectorQuery = sql`
SELECT id, content, metadata,
1 - (embedding <=> ${queryEmbedding}::vector) AS similarity
FROM chunks
${userId ? sql`WHERE metadata->>'acl' @> ${JSON.stringify([userId])}` : sql``}
ORDER BY embedding <=> ${queryEmbedding}::vector
LIMIT ${limit * 3}
`;
// BM25 keyword search (PostgreSQL FTS)
const keywordQuery = sql`
SELECT id, content, metadata,
ts_rank_cd(tsv, plainto_tsquery(${query})) AS rank
FROM chunks
WHERE tsv @@ plainto_tsquery(${query})
${userId ? sql`AND metadata->>'acl' @> ${JSON.stringify([userId])}` : sql``}
ORDER BY rank DESC
LIMIT ${limit * 3}
`;
const [vectorResults, keywordResults] = await Promise.all([
vectorQuery,
keywordQuery,
]);
// Fuse results: weighted combination
const fused = reciprocalRankFusion(vectorResults, keywordResults, {
k: 60, // RRF constant
weights: { vector: 0.7, keyword: 0.3 },
});
return fused.slice(0, limit);
}
Next steps
- If your backend is Laravel, see: Laravel AI Integration (RAG + Agents)
- If you're exploring agent orchestration, see: AI Agent Frameworks (2026)
- Starting a new project? Scaffold an AI-ready monorepo with @yabasha/cas, the CLI I use to bootstrap retrieval and agent services from a clean baseline.
Want a fast, pragmatic assessment of your data + use case? Let's talk.
Want this implemented end-to-end?
If you want a production-grade RAG assistant or agentic workflow— with proper evaluation, access control, and observability—let's scope it.