Five tactical optimizations that took our RAG system from bleeding money to sustainable. No rewrites, no downtime, no praying.

The invoice hit my inbox like a gut punch. $4,200 in one week for LLM calls. Our RAG system was hemorrhaging money and I had exactly zero appetite for a ground-up rewrite. Production was live. Users were active. The constraint was simple: fix it without breaking it.
This is how we cut costs by 60% in two sprints.
We built a RAG system for a client that answers questions against their internal knowledge base. Standard stack: Next.js frontend, Convex backend, Qdrant for vectors, OpenAI for inference. Looked great in demos. Worked fine for the first hundred users.
Then scale happened.
The vector search was returning 10 chunks per query. Every chunk was 800+ tokens. System prompts had bloated to 400 tokens of "helpful AI assistant" fluff. We had no caching. No routing logic. No visibility into what each call actually cost. Users were asking the same questions repeatedly—"What's the refund policy?"—and we were burning GPT-4 tokens on every single one.
That $4,200 week was the wake-up call.
Production system. Live users. Sub-2-second response SLA. No database migrations, no downtime, no quality regression. Surgical changes only — observable, reversible, measured.
Five changes. Stacked. Each one built on the visibility the previous one created.
This was the biggest win and the first thing you should build if you haven't already.
The insight: 80% of user queries are variations of the same 20 questions. "How do I reset my password?" "Password reset?" "I forgot my login"—these are semantically identical. We were paying GPT-4 to answer them fresh every time.
We added a Redis layer keyed by query embedding hash. Incoming query → embed → check Redis → cache hit? Return cached response. Miss? Hit the LLM, cache the result.
// packages/ai/cache.ts
const CACHE_TTL = 60 * 60 * 24 * 7;
export async function getCachedResponse(
embedding: number[]
): Promise<string | null> {
return redis.get(`rag:cache:${hashEmbedding(embedding)}`);
}
export async function cacheResponse(
embedding: number[],
response: string
): Promise<void> {
await redis.setex(`rag:cache:${hashEmbedding(embedding)}`, CACHE_TTL, response);
}Cache similarity threshold: 0.92 cosine. High enough to avoid false positives. Low enough to catch legitimate variations.
The lesson: Caching isn't just for speed. At $0.03 per 1K tokens for GPT-4, every cache hit is money in the bank.
Our system prompt had grown to 420 tokens of corporate-speak. "You are a helpful, knowledgeable assistant..." Cut it. Ruthlessly.
We went from this:
You are a helpful AI assistant. You are knowledgeable about our products
and services. You should provide accurate, helpful responses based on the
context provided. If you don't know the answer, say so. Be polite and
professional at all times...
To this:
Answer based on context. Say "I don't know" if context lacks info. Be concise.
47 tokens. That's it.
Then we tackled the chunks. Our retrieval was returning 10 chunks averaging 800 tokens each. That's 8,000 tokens of context before the user even asked their question. We rewrote the chunking strategy:
Average context dropped from 8,000 tokens to ~1,200.
Here's the protocol: Review your actual prompts in production. Not the templates—the rendered prompts. You'll find fat everywhere.
Not every query needs GPT-4. We built a simple classifier that routes queries to the cheapest model that can handle them.
Simple queries—definitions, lookups, yes/no questions—go to GPT-3.5 Turbo. Complex reasoning, multi-step problems, creative tasks—GPT-4o.
// packages/ai/router.ts
export async function routeQuery(query: string): Promise<ModelConfig> {
const classification = await classifyIntent(query);
switch (classification) {
case 'simple_lookup':
case 'definition':
case 'boolean':
return { model: 'gpt-3.5-turbo', maxTokens: 150 };
case 'reasoning':
case 'creative':
case 'complex':
return { model: 'gpt-4o', maxTokens: 500 };
default:
return { model: 'gpt-4o-mini', maxTokens: 300 };
}
}Classification is done with a cheap embedding + cosine similarity against labeled examples. Cost: negligible. Savings: 8% of total spend.
The tradeoff: We accept a slight quality dip on edge cases. But we track it—more on that in a second.
We added a hard relevance threshold. Any chunk scoring below 0.78 similarity gets dropped. No exceptions.
Before: 10 chunks, average relevance 0.65. Lots of noise. After: 3 chunks, average relevance 0.89. Signal only.
This isn't just about tokens. Garbage context makes the model dumber. We'd seen hallucinations spike when we injected low-relevance chunks. Cutting them improved quality and cut costs.
We also added a max_context_tokens parameter to our RAG pipeline. Hard ceiling. If the query + context exceeds it, we truncate from the oldest chunks first.
Surprises are expensive. We added explicit max_tokens on every single LLM call.
const response = await generateText({
model: routedModel,
maxTokens: getMaxTokensForIntent(classification),
// ... rest of config
});Default caps by intent:
We had one query that returned a 2,400 token response because a user asked "Tell me everything about..." No more. The model stops. The user gets a "Continue?" button if they want more.
Output tokens cost the same as input tokens. Uncapped outputs are an unforced error.
All of this lives in packages/ai—our single control plane for every LLM call in the CAS stack.
packages/ai/
├── index.ts # Main exports
├── cache.ts # Redis semantic cache
├── router.ts # Model routing logic
├── guardrails.ts # Input/output validation
├── cost-tracker.ts # Langfuse integration
└── types.ts # Shared types
Every call goes through the same wrapper:
// packages/ai/index.ts
export async function generateRAGResponse(
query: string,
context: Chunk[]
): Promise<AIResponse> {
const embedding = await embed(query);
const cached = await getCachedResponse(embedding);
if (cached) return cached;
const modelConfig = await routeQuery(query);
const prompt = buildCompressedPrompt(query, context);
const validated = await applyInputGuardrails(prompt);
const response = await callLLM({ ...modelConfig, prompt: validated });
await trackCost(response.usage, modelConfig.model);
await cacheResponse(embedding, response.text);
return response;
}This is table stakes. One chokepoint for all LLM calls. No bypassing. No exceptions.
None of this works without observability. Langfuse was the difference between guessing and knowing.
Every LLM call gets traced:
We built a dashboard that updates every 5 minutes:
The rule: You cannot optimize what you cannot see. Set up tracing before you touch a single prompt.
Not everything we tried landed. Three attempts that failed:
1. Response summarization — We tried caching compressed summaries instead of full responses, then expanding them on retrieval. The "expansion" step often hallucinated details that weren't in the original. Quality dropped. Rolled back.
2. Dynamic chunk sizing — The idea: adapt chunk size based on query complexity. Implementation was brittle, added latency, and the gains were marginal (~2%). Not worth the complexity.
3. Client-side caching — We briefly considered caching in the browser to cut redundant requests. Then we remembered our users share workstations. Privacy risk, gone.
The takeaway: Failed experiments are data. Each dead end clarified what actually mattered.
Week 0 (pre-optimization): $4,200
Week 6 (post-optimization): $1,680
60% reduction.
Per-query breakdown:
Cache hit rate: 62% (up from 0%) Average context tokens: 1,200 (down from 8,000) Average latency: 890ms (down from 1,400ms) User satisfaction: Unchanged (actually up slightly—faster responses)
max_tokens on every call. Every single one.Demos are liars. Ten users won't show you where your system bleeds. Cost curves are non-linear—small inefficiencies compound exponentially.
Observability before optimization. We spent a day instrumenting before touching a single prompt. That day paid for itself tenfold. You cannot optimize what you cannot measure.
The constraint was the feature. "No rewrite" forced surgical thinking. Half the wins came from asking: "What can we change without touching the database?"
Boring wins beat clever losses. Semantic caching isn't novel. It just works. We tried three "innovative" approaches that failed. The boring Redis layer saved us.
The system still runs. Costs flatlined while usage climbed. No more surprise invoices.
If you don't know your exact cost per query, you're flying blind. Fix observability first. Everything else follows.
Questions: @bayyash · Code: composable-ai-stack

AI Engineer & Full-Stack Tech Lead
Expertise: 20+ years full-stack development. Specializing in architecting cognitive systems, RAG architectures, and scalable web platforms for the MENA region.



