yabasha.dev
HomeAboutCVServicesQualificationsSpeakingBlogGuidesProjectsUsesContact
Bashar Ayyash (Yabasha)
AI Engineer & Full‑Stack Tech Lead • Amman, Jordan
PrivacySecurityGuidesImprintGitHubLinkedInhello@yabasha.dev
© 2026 yabasha.dev
HomeBlogCutting LLM Costs 60%: A Production RAG Post-Mortem
Back to Blog
AI & Web Development Insights

Cutting LLM Costs 60%: A Production RAG Post-Mortem

Five tactical optimizations that took our RAG system from bleeding money to sustainable. No rewrites, no downtime, no praying.

Bashar AyyashMarch 5, 20268 min read1,503 words
Cutting LLM Costs 60%: A Production RAG Post-Mortem

Cutting LLM Costs 60%: A Production RAG Post-Mortem

The invoice hit my inbox like a gut punch. $4,200 in one week for LLM calls. Our RAG system was hemorrhaging money and I had exactly zero appetite for a ground-up rewrite. Production was live. Users were active. The constraint was simple: fix it without breaking it.

This is how we cut costs by 60% in two sprints.


The Problem:

We built a RAG system for a client that answers questions against their internal knowledge base. Standard stack: Next.js frontend, Convex backend, Qdrant for vectors, OpenAI for inference. Looked great in demos. Worked fine for the first hundred users.

Then scale happened.

The vector search was returning 10 chunks per query. Every chunk was 800+ tokens. System prompts had bloated to 400 tokens of "helpful AI assistant" fluff. We had no caching. No routing logic. No visibility into what each call actually cost. Users were asking the same questions repeatedly—"What's the refund policy?"—and we were burning GPT-4 tokens on every single one.

That $4,200 week was the wake-up call.


The Constraints:

Production system. Live users. Sub-2-second response SLA. No database migrations, no downtime, no quality regression. Surgical changes only — observable, reversible, measured.


The Blueprint:

Five changes. Stacked. Each one built on the visibility the previous one created.

1. Semantic Caching with Redis (~35% reduction)

This was the biggest win and the first thing you should build if you haven't already.

The insight: 80% of user queries are variations of the same 20 questions. "How do I reset my password?" "Password reset?" "I forgot my login"—these are semantically identical. We were paying GPT-4 to answer them fresh every time.

We added a Redis layer keyed by query embedding hash. Incoming query → embed → check Redis → cache hit? Return cached response. Miss? Hit the LLM, cache the result.

// packages/ai/cache.ts
const CACHE_TTL = 60 * 60 * 24 * 7;

export async function getCachedResponse(
  embedding: number[]
): Promise<string | null> {
  return redis.get(`rag:cache:${hashEmbedding(embedding)}`);
}

export async function cacheResponse(
  embedding: number[],
  response: string
): Promise<void> {
  await redis.setex(`rag:cache:${hashEmbedding(embedding)}`, CACHE_TTL, response);
}

Cache similarity threshold: 0.92 cosine. High enough to avoid false positives. Low enough to catch legitimate variations.

The lesson: Caching isn't just for speed. At $0.03 per 1K tokens for GPT-4, every cache hit is money in the bank.


2. Prompt Compression (~15% reduction)

Our system prompt had grown to 420 tokens of corporate-speak. "You are a helpful, knowledgeable assistant..." Cut it. Ruthlessly.

We went from this:

You are a helpful AI assistant. You are knowledgeable about our products and services. You should provide accurate, helpful responses based on the context provided. If you don't know the answer, say so. Be polite and professional at all times...

To this:

Answer based on context. Say "I don't know" if context lacks info. Be concise.

47 tokens. That's it.

Then we tackled the chunks. Our retrieval was returning 10 chunks averaging 800 tokens each. That's 8,000 tokens of context before the user even asked their question. We rewrote the chunking strategy:

  • Smaller chunks: 512 tokens max, 50 token overlap
  • Relevance threshold: Only inject chunks scoring >0.78 similarity
  • Hard cap: Maximum 3 chunks per query unless explicitly overridden

Average context dropped from 8,000 tokens to ~1,200.

Here's the protocol: Review your actual prompts in production. Not the templates—the rendered prompts. You'll find fat everywhere.


3. Model Routing (~8% reduction)

Not every query needs GPT-4. We built a simple classifier that routes queries to the cheapest model that can handle them.

Simple queries—definitions, lookups, yes/no questions—go to GPT-3.5 Turbo. Complex reasoning, multi-step problems, creative tasks—GPT-4o.

// packages/ai/router.ts
export async function routeQuery(query: string): Promise<ModelConfig> {
  const classification = await classifyIntent(query);
  
  switch (classification) {
    case 'simple_lookup':
    case 'definition':
    case 'boolean':
      return { model: 'gpt-3.5-turbo', maxTokens: 150 };
    case 'reasoning':
    case 'creative':
    case 'complex':
      return { model: 'gpt-4o', maxTokens: 500 };
    default:
      return { model: 'gpt-4o-mini', maxTokens: 300 };
  }
}

Classification is done with a cheap embedding + cosine similarity against labeled examples. Cost: negligible. Savings: 8% of total spend.

The tradeoff: We accept a slight quality dip on edge cases. But we track it—more on that in a second.


4. Context Window Discipline (~4% reduction)

We added a hard relevance threshold. Any chunk scoring below 0.78 similarity gets dropped. No exceptions.

Before: 10 chunks, average relevance 0.65. Lots of noise. After: 3 chunks, average relevance 0.89. Signal only.

This isn't just about tokens. Garbage context makes the model dumber. We'd seen hallucinations spike when we injected low-relevance chunks. Cutting them improved quality and cut costs.

We also added a max_context_tokens parameter to our RAG pipeline. Hard ceiling. If the query + context exceeds it, we truncate from the oldest chunks first.


5. Output Length Caps (~3% reduction)

Surprises are expensive. We added explicit max_tokens on every single LLM call.

const response = await generateText({
  model: routedModel,
  maxTokens: getMaxTokensForIntent(classification),
  // ... rest of config
});

Default caps by intent:

  • Lookup/definition: 150 tokens
  • Explanation: 300 tokens
  • Complex reasoning: 500 tokens
  • Creative: 800 tokens (explicit override required)

We had one query that returned a 2,400 token response because a user asked "Tell me everything about..." No more. The model stops. The user gets a "Continue?" button if they want more.

Output tokens cost the same as input tokens. Uncapped outputs are an unforced error.


Implementation: The Guardrail Pattern

All of this lives in packages/ai—our single control plane for every LLM call in the CAS stack.

packages/ai/ ├── index.ts # Main exports ├── cache.ts # Redis semantic cache ├── router.ts # Model routing logic ├── guardrails.ts # Input/output validation ├── cost-tracker.ts # Langfuse integration └── types.ts # Shared types

Every call goes through the same wrapper:

// packages/ai/index.ts
export async function generateRAGResponse(
  query: string,
  context: Chunk[]
): Promise<AIResponse> {
  const embedding = await embed(query);
  
  const cached = await getCachedResponse(embedding);
  if (cached) return cached;
  
  const modelConfig = await routeQuery(query);
  const prompt = buildCompressedPrompt(query, context);
  const validated = await applyInputGuardrails(prompt);
  
  const response = await callLLM({ ...modelConfig, prompt: validated });
  
  await trackCost(response.usage, modelConfig.model);
  await cacheResponse(embedding, response.text);
  
  return response;
}

This is table stakes. One chokepoint for all LLM calls. No bypassing. No exceptions.


The Visibility Layer:

None of this works without observability. Langfuse was the difference between guessing and knowing.

Every LLM call gets traced:

  • Input tokens, output tokens, total cost
  • Model used, latency, cache hit/miss
  • User ID, session ID, query classification
  • Rendered prompt (for debugging compression)

We built a dashboard that updates every 5 minutes:

  • Cost per query (trending down)
  • Cache hit rate (target: >40%, now at 62%)
  • Model distribution (% on GPT-3.5 vs GPT-4)
  • Latency percentiles (p50, p95, p99)

The rule: You cannot optimize what you cannot see. Set up tracing before you touch a single prompt.


What Didn't Work:

Not everything we tried landed. Three attempts that failed:

1. Response summarization — We tried caching compressed summaries instead of full responses, then expanding them on retrieval. The "expansion" step often hallucinated details that weren't in the original. Quality dropped. Rolled back.

2. Dynamic chunk sizing — The idea: adapt chunk size based on query complexity. Implementation was brittle, added latency, and the gains were marginal (~2%). Not worth the complexity.

3. Client-side caching — We briefly considered caching in the browser to cut redundant requests. Then we remembered our users share workstations. Privacy risk, gone.

The takeaway: Failed experiments are data. Each dead end clarified what actually mattered.


Results:

Week 0 (pre-optimization): $4,200

Week 6 (post-optimization): $1,680

60% reduction.

Per-query breakdown:

  • Before: $0.08 average
  • After: $0.032 average

Cache hit rate: 62% (up from 0%) Average context tokens: 1,200 (down from 8,000) Average latency: 890ms (down from 1,400ms) User satisfaction: Unchanged (actually up slightly—faster responses)


TL;DR:

  • Semantic caching is free money. 62% of queries never hit the LLM. Redis pays for itself in a day.
  • Prompt compression is underrated. Cut 300+ tokens of fluff from your system prompts. Review the rendered prompts, not templates.
  • Model routing works. GPT-3.5 handles 40% of queries just fine. Don't pay for reasoning when all you need is lookup.
  • Relevance thresholds cut noise. Low-relevance chunks don't help—they hurt. Drop them aggressively.
  • Caps prevent surprises. max_tokens on every call. Every single one.
  • Visibility is non-negotiable. Use Langfuse, Braintrust, or whatever—but track actual costs per query before you optimize.
  • Guardrail pattern = control plane. One wrapper. All LLM calls. No bypass.

Lessons Learned:

Demos are liars. Ten users won't show you where your system bleeds. Cost curves are non-linear—small inefficiencies compound exponentially.

Observability before optimization. We spent a day instrumenting before touching a single prompt. That day paid for itself tenfold. You cannot optimize what you cannot measure.

The constraint was the feature. "No rewrite" forced surgical thinking. Half the wins came from asking: "What can we change without touching the database?"

Boring wins beat clever losses. Semantic caching isn't novel. It just works. We tried three "innovative" approaches that failed. The boring Redis layer saved us.


The system still runs. Costs flatlined while usage climbed. No more surprise invoices.

If you don't know your exact cost per query, you're flying blind. Fix observability first. Everything else follows.


Questions: @bayyash · Code: composable-ai-stack

Tagged with:
#LLM#RAG#Cost Optimization#Vercel AI SDK#Langfuse#Redis#Production#AI Engineering
Bashar Ayyash
AUTHOR

Bashar Ayyash (Yabasha)

AI Engineer & Full-Stack Tech Lead

Expertise: 20+ years full-stack development. Specializing in architecting cognitive systems, RAG architectures, and scalable web platforms for the MENA region.

GitHubLinkedInX (Twitter)

Related Articles

The Tailwind Tsunami: How a CSS Framework's Collapse Signals the End of Software Development as We Knew It

The Tailwind Tsunami: How a CSS Framework's Collapse Signals the End of Software Development as We Knew It

January 25, 2026•12 min
A Production-Ready Laravel Next.js Architecture Blueprint

A Production-Ready Laravel Next.js Architecture Blueprint

January 5, 2026•6 min
The "Cache Handshake": How Laravel Events Control Next.js 16 ISR

The "Cache Handshake": How Laravel Events Control Next.js 16 ISR

December 22, 2025•8 min
How I Built an AI Agent for my Portfolio (Yabasha.dev) using Laravel & Next.js

How I Built an AI Agent for my Portfolio (Yabasha.dev) using Laravel & Next.js

December 19, 2025•2 min