yabasha.dev
HomeAboutCVServicesQualificationsSpeakingBlogGuidesProjectsUsesContact
Bashar Ayyash (Yabasha)
AI Engineer & Full‑Stack Tech Lead • Amman, Jordan
PrivacySecurityGuidesImprintGitHubLinkedInhello@yabasha.dev
© 2026 yabasha.dev
HomeBlogBuilding Production AI Agents That Actually Work
Back to Blog
AI & Web Development Insights

Building Production AI Agents That Actually Work

Move beyond fragile AI demos with this guide to building production-ready agents. Master task decomposition, guardrails, and observability for reliable systems.

Bashar AyyashMarch 6, 202610 min read1,859 words
Building Production AI Agents That Actually Work

Building Production AI Agents That Actually Work

The cobbler's children have no shoes.

I've watched this play out dozens of times: a team ships a slick AI demo, investors nod approvingly, then the first real user hits an edge case and the whole thing collapses. Hallucinations multiply. Context windows explode. API costs spike 400% overnight. The demo looked brilliant; the production system is a liability.

This isn't a failure of LLM technology. It's a failure of architecture.

If you're building AI agents that need to survive contact with real users, real data, and real error conditions, you need more than a clever prompt. You need systems thinking. This is what I've learned from shipping agents that actually stay up.

The Problem: Why Most Agent Tutorials Are Toy Demos

Walk through any "Build an AI Agent in 10 Minutes" tutorial and you'll see the same pattern:

const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: userInput }
  ],
  tools: [someTool],
});

That's not an agent. That's a function call with extra steps.

What breaks in production:

  • No retry logic. The LLM API times out, throws a 529, or returns malformed JSON. Your entire request chain dies.
  • No validation. The model outputs garbage, your downstream service accepts it, and now your database has inconsistent data.
  • No cost tracking. You're burning through tokens with zero visibility into which calls matter and which are waste.
  • No observability. Something failed three hours ago. Good luck finding it in your logs.
  • No fallbacks. The model hallucinated a tool call. The tool 500'd. Now what?

Production agents aren't about the happy path. They're about gracefully handling every possible failure mode while staying within budget.

The Architecture That Works

After several iterations of broken agents, I landed on a pattern that holds up: task decomposition + tool calling + guardrails + fallback chains.

Task Decomposition

Don't ask an LLM to do everything at once. Break the job into discrete, verifiable steps:

User Request → Intent Classification → Parameter Extraction → Tool Selection → Tool Execution → Response Synthesis

Each step is a separate LLM call with a focused prompt. This is slower than one megaprompt, but it's debuggable, testable, and you can cache intermediate results.

Tool Calling with Schema Enforcement

Vercel AI SDK's generateObject is your friend here. Define exactly what you expect:

import { generateObject } from 'ai';
import { z } from 'zod';

const ToolCallSchema = z.object({
  tool: z.enum(['search_orders', 'create_ticket', 'escalate']),
  params: z.record(z.unknown()),
  confidence: z.number().min(0).max(1),
});

const { object: toolCall } = await generateObject({
  model: openai('gpt-4'),
  schema: ToolCallSchema,
  prompt: classifyIntent(userMessage),
});

If the model can't produce valid JSON within the schema, the generateObject layer retries or fails gracefully. No more parsing broken tool calls at 3 AM.

Guardrails: Input and Output Validation

Every agent call needs two checkpoints:

Input validation: Is this request safe to process? Rate limit check. PII scan. Token count validation. Malformed input rejection.

Output validation: Did the model return what we asked for? Schema match. Sanity checks (e.g., a "refund amount" shouldn't exceed the order total). Hallucination detection for specific fields.

I wrap every LLM call in a guardrail function:

async function withGuardrails<T>(
  call: () => Promise<T>,
  validateInput: (input: unknown) => void,
  validateOutput: (output: T) => void,
  fallback: () => T
): Promise<T> {
  try {
    validateInput(input);
    const result = await call();
    validateOutput(result);
    return result;
  } catch (error) {
    // Log to Langfuse, trigger alert, return fallback
    return fallback();
  }
}

Fallback Chains

When the primary model fails, have a Plan B:

  1. Retry with the same model (transient errors)
  2. Fallback to a cheaper/smaller model (cost control)
  3. Use cached results (if available and fresh enough)
  4. Return a graceful degradation response ("I can help with that, but I'll need more information")
  5. Escalate to human (critical failures)

The key is deciding your fallback strategy before you need it, not during an incident.

Concrete Patterns for Production

Here are the specific patterns I use in every production agent system:

Retry with Exponential Backoff

LLM APIs fail. Rate limits hit. Network blips. Handle it:

import { retry } from 'ts-retry-promise';

const result = await retry(
  () => generateObject({ /* ... */ }),
  {
    retries: 3,
    delay: 1000,
    backoff: 'EXPONENTIAL',
    retryIf: (error) => 
      error.status === 429 || 
      error.status >= 500 ||
      error.code === 'ETIMEDOUT'
  }
);

Don't retry 4xx errors (that's your bug), but do retry rate limits and server errors.

Cost Tracking Per Agent Call

Every single LLM call should log:

  • Input tokens
  • Output tokens
  • Total cost (calculated)
  • Model used
  • Call duration
  • Success/failure status

Langfuse handles this automatically, but you need to instrument it:

import { langfuse } from '@yabasha/cas/observability';

const trace = langfuse.trace({ name: 'order-assistant', userId: request.userId });
const generation = trace.generation({
  name: 'classify-intent',
  model: 'gpt-4',
  input: userMessage,
});

try {
  const result = await classifyIntent(userMessage);
  generation.end({ output: result, usage: result.usage });
} catch (error) {
  generation.end({ error: error.message });
  throw error;
}

Now you can answer: "We spent $47 on LLM calls yesterday. $38 of that came from the order-assistant agent. $12 of that was from retry loops."

Response Caching

Many agent calls are deterministic given the same input. Cache them:

const cacheKey = hash({ prompt, model, temperature: 0 }); // deterministic only
const cached = await cache.get(cacheKey);
if (cached) return cached;

const result = await generateObject({ /* ... */ });
await cache.set(cacheKey, result, { ttl: '1h' });

Use temperature=0 for cacheable calls, temperature>0 for creative/generative tasks only.

Observability: You Can't Fix What You Can't See

Production agents are distributed systems. Every call is a potential failure point. You need:

Langfuse Tracing

Langfuse gives you distributed traces across your agent execution. You can see:

  • The full execution graph of an agent run
  • Token usage and cost at every step
  • Latency breakdown (where is time actually spent?)
  • Input/output for debugging
// In your Convex action
export const runAgent = action({
  args: { conversationId: v.id('conversations') },
  handler: async (ctx, args) => {
    const trace = langfuse.trace({ 
      name: 'customer-support-agent',
      metadata: { conversationId: args.conversationId }
    });
    
    // Each step gets its own span
    const intentSpan = trace.span({ name: 'classify-intent' });
    const intent = await classifyIntent(ctx, args);
    intentSpan.end();
    
    const toolSpan = trace.span({ name: 'execute-tools' });
    const result = await executeTools(ctx, intent);
    toolSpan.end();
    
    trace.update({ status: 'success' });
    return result;
  },
});

Eval Loops

You need automated evaluation running continuously. Not just "did it work?" but "was it good?"

I run three types of evals:

  1. Deterministic checks: Did the output match the schema? Did it contain required fields?
  2. LLM-as-judge: Use a cheap model (like gpt-4o-mini) to grade outputs against criteria
  3. Human spot-checks: Sample 1% of production traffic for manual review
// In packages/evals/src/evaluators.ts
export const intentAccuracyEval = defineEval({
  name: 'intent-classification-accuracy',
  dataset: 'intent-classification-test-set',
  evaluator: async ({ input, expected }) => {
    const result = await classifyIntent(input);
    const judge = await generateObject({
      model: openai('gpt-4o-mini'),
      schema: z.object({ correct: z.boolean(), reason: z.string() }),
      prompt: `Did the model classify "${result.intent}" correctly? Expected: ${expected}`,
    });
    return { score: judge.correct ? 1 : 0, reason: judge.reason };
  },
});

Run these in CI on every push, and nightly against production samples.

Cost Attribution

Know which features cost what:

// Tag every trace with the feature that triggered it
langfuse.trace({
  name: 'support-agent',
  tags: ['feature:order-lookup', 'environment:production'],
});

Now you can query: "How much did the order-lookup feature cost in March?" This is essential for pricing decisions and capacity planning.

The Agent Loop Pattern

After trying various architectures, I settled on a consistent loop pattern:

spawn → execute → validate → retry or escalate

Spawn: Create a trace/span for this agent run. Load context (conversation history, user profile). Set guardrails.

Execute: Run the core agent logic (intent classification, tool calls, response generation). This is where the LLM calls happen.

Validate: Check the output against schema, business rules, and safety constraints. Is this safe to return to the user? Is it accurate?

Retry or Escalate:

  • If validation fails and we have retries left: fix the input (add clarification), retry
  • If validation fails and we're out of retries: return a graceful fallback response
  • If this is a critical failure: escalate to human and alert

This loop runs for every user request. It's not elegant, but it's reliable.

Real Deployment Considerations

Timeouts

LLM calls are slow. Set aggressive timeouts:

// In Convex actions
export const agentAction = action({
  args: { message: v.string() },
  returns: v.promise(v.string()),
  handler: async (ctx, args) => {
    // Convex has a 30s action timeout
    // We budget 25s for LLM calls, 5s for overhead
    const timeoutPromise = new Promise((_, reject) => 
      setTimeout(() => reject(new Error('AGENT_TIMEOUT')), 25000)
    );
    
    return Promise.race([
      runAgent(args.message),
      timeoutPromise
    ]);
  },
});

Have a fallback ready when the timeout hits. Don't leave users hanging.

Rate Limits

Every LLM provider has different rate limits. Track your usage:

// Simple token bucket in Redis/Upstash Redis
const rateLimiter = new TokenBucket({
  name: 'openai-requests',
  capacity: 100,  // requests per minute
  refillRate: 100/60,  // per second
});

if (!(await rateLimiter.take())) {
  // Queue for later or fallback to cached/cheaper model
  await queueForLater(request);
  return { status: 'queued', message: 'Processing in background' };
}

Graceful Degradation

When everything fails, what does the user see?

Have a dead-simple fallback agent:

export const fallbackAgent = async (userMessage: string) => {
  // Uses no tools, just responds politely
  return generateText({
    model: openai('gpt-3.5-turbo'),  // cheap, fast
    prompt: `We're experiencing technical difficulties. Respond helpfully to: "${userMessage}" without making commitments.`,
  });
};

This keeps the conversation going while you fix the underlying issue.

What CAS Provides Out of the Box

The Composable AI Stack (CAS) encodes all these patterns so you don't have to rebuild them:

PatternCAS Implementation
Task Decompositionpackages/agents with composable step functions
Tool CallingVercel AI SDK integration with Zod schemas
Guardrails@yabasha/cas/guardrails with input/output validators
Retry LogicBuilt into the LLM client layer
Cost TrackingLangfuse integration with per-call attribution
ObservabilityFull tracing + eval harness in packages/evals
Agent LoopAgentRunner class implementing spawn→execute→validate→retry
TimeoutsConvex action-level + client-level timeout handling
Fallback ChainsConfigurable fallback strategies per agent

Initialize a new project:

bun add -g @yabasha/cas
cas init my-agent-system
cd my-agent-system
bun install
cas dev

You get a working agent system with all the production patterns already wired up. The examples include a customer support agent, a data analysis agent, and a code review agent — each showing different patterns for decomposition, tool use, and error handling.

TL;DR: Production Agent Checklist

  • Decompose tasks into verifiable steps
  • Use Zod schemas for all LLM outputs
  • Wrap every call in guardrails (input + output validation)
  • Implement retry with exponential backoff
  • Track cost per call, per feature, per user
  • Use Langfuse for distributed tracing
  • Run automated evals on every change
  • Set aggressive timeouts with fallbacks
  • Handle rate limits explicitly
  • Plan graceful degradation for total failure
  • Document your fallback strategy before you need it

What's Next

Agent systems are still maturing. The tools are getting better (Vercel AI SDK, Langfuse, Convex), but the hard part remains architecture and observability. The teams that win won't be the ones with the cleverest prompts. They'll be the ones with the most reliable systems.

If you're building agents for production, start with the failure modes. Design your system to degrade gracefully. Instrument everything. And for the love of all that is holy, don't copy the 10-minute tutorial code into your production repo.

Questions? I'm @yabasha on Twitter/X, or reach out through yabasha.dev.

<!-- meta_title: "Building Production AI Agents That Actually Work | Yabasha" meta_description: "Stop copying toy demos. Learn the architecture, patterns, and observability practices that make AI agents reliable in production. Includes code examples and the Composable AI Stack." primary_keyword: "production ai agents" secondary_keywords: ["ai agent architecture", "llm production patterns", "vercel ai sdk", "langfuse tracing", "ai agent guardrails", "typescript ai agents", "convex backend"] ai_summary_short: "A practical guide to building reliable AI agents for production, covering architecture patterns (task decomposition, guardrails, fallback chains), observability with Langfuse, cost tracking, and the Composable AI Stack." ai_level: "intermediate" ai_intent: "learn production patterns for ai agents" -->
Tagged with:
#LLM#Vercel AI SDK#Langfuse#Production#AI#Agents#Architecture#AI Agents#Production Engineering#ai agent architecture#llm production patterns#ai agent guardrails#langfuse tracing
Bashar Ayyash
AUTHOR

Bashar Ayyash (Yabasha)

AI Engineer & Full-Stack Tech Lead

Expertise: 20+ years full-stack development. Specializing in architecting cognitive systems, RAG architectures, and scalable web platforms for the MENA region.

GitHubLinkedInX (Twitter)

Related Articles

Cutting LLM Costs 60%: A Production RAG Post-Mortem

Cutting LLM Costs 60%: A Production RAG Post-Mortem

March 6, 2026•8 min
The Tailwind Tsunami: How a CSS Framework's Collapse Signals the End of Software Development as We Knew It

The Tailwind Tsunami: How a CSS Framework's Collapse Signals the End of Software Development as We Knew It

January 25, 2026•12 min
Leader Quick Note: How to Keep Your Team Survived in AI Era

Leader Quick Note: How to Keep Your Team Survived in AI Era

January 19, 2026•3 min
A Production-Ready Laravel Next.js Architecture Blueprint

A Production-Ready Laravel Next.js Architecture Blueprint

January 5, 2026•6 min