Building Production AI Agents That Actually Work

The cobbler's children have no shoes.

I've watched this play out dozens of times: a team ships a slick AI demo, investors nod approvingly, then the first real user hits an edge case and the whole thing collapses. Hallucinations multiply. Context windows explode. API costs spike 400% overnight. The demo looked brilliant; the production system is a liability.

This isn't a failure of LLM technology. It's a failure of architecture.

If you're building AI agents that need to survive contact with real users, real data, and real error conditions, you need more than a clever prompt. You need systems thinking. This is what I've learned from shipping agents that actually stay up.

The Problem: Why Most Agent Tutorials Are Toy Demos

Walk through any "Build an AI Agent in 10 Minutes" tutorial and you'll see the same pattern:

const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: userInput }
  ],
  tools: [someTool],
});

That's not an agent. That's a function call with extra steps.

What breaks in production:

No retry logic. The LLM API times out, throws a 529, or returns malformed JSON. Your entire request chain dies.
No validation. The model outputs garbage, your downstream service accepts it, and now your database has inconsistent data.
No cost tracking. You're burning through tokens with zero visibility into which calls matter and which are waste.
No observability. Something failed three hours ago. Good luck finding it in your logs.
No fallbacks. The model hallucinated a tool call. The tool 500'd. Now what?

Production agents aren't about the happy path. They're about gracefully handling every possible failure mode while staying within budget.

The $18K Ceiling Breaker: Skills That Actually Move Your Number

6 min

Why My AI Prompts Are 12 Words Long — And Yours Should Collapse Too

7 min

The Architecture That Works

After several iterations of broken agents, I landed on a pattern that holds up: task decomposition + tool calling + guardrails + fallback chains.

Task Decomposition

Don't ask an LLM to do everything at once. Break the job into discrete, verifiable steps:

User Request → Intent Classification → Parameter Extraction → Tool Selection → Tool Execution → Response Synthesis

Each step is a separate LLM call with a focused prompt. This is slower than one megaprompt, but it's debuggable, testable, and you can cache intermediate results.

Tool Calling with Schema Enforcement

Vercel AI SDK's generateObject is your friend here. Define exactly what you expect:

import { generateObject } from 'ai';
import { z } from 'zod';

const ToolCallSchema = z.object({
  tool: z.enum(['search_orders', 'create_ticket', 'escalate']),
  params: z.record(z.unknown()),
  confidence: z.number().min(0).max(1),
});

const { object: toolCall } = await generateObject({
  model: openai('gpt-4'),
  schema: ToolCallSchema,
  prompt: classifyIntent(userMessage),
});

If the model can't produce valid JSON within the schema, the generateObject layer retries or fails gracefully. No more parsing broken tool calls at 3 AM.

Guardrails: Input and Output Validation

Every agent call needs two checkpoints:

Input validation: Is this request safe to process? Rate limit check. PII scan. Token count validation. Malformed input rejection.

Output validation: Did the model return what we asked for? Schema match. Sanity checks (e.g., a "refund amount" shouldn't exceed the order total). Hallucination detection for specific fields.

I wrap every LLM call in a guardrail function:

async function withGuardrails<T>(
  call: () => Promise<T>,
  validateInput: (input: unknown) => void,
  validateOutput: (output: T) => void,
  fallback: () => T
): Promise<T> {
  try {
    validateInput(input);
    const result = await call();
    validateOutput(result);
    return result;
  } catch (error) {
    // Log to Langfuse, trigger alert, return fallback
    return fallback();
  }
}

Fallback Chains

When the primary model fails, have a Plan B:

Retry with the same model (transient errors)
Fallback to a cheaper/smaller model (cost control)
Use cached results (if available and fresh enough)
Return a graceful degradation response ("I can help with that, but I'll need more information")
Escalate to human (critical failures)

The key is deciding your fallback strategy before you need it, not during an incident.

Concrete Patterns for Production

Here are the specific patterns I use in every production agent system:

Retry with Exponential Backoff

LLM APIs fail. Rate limits hit. Network blips. Handle it:

import { retry } from 'ts-retry-promise';

const result = await retry(
  () => generateObject({ /* ... */ }),
  {
    retries: 3,
    delay: 1000,
    backoff: 'EXPONENTIAL',
    retryIf: (error) => 
      error.status === 429 || 
      error.status >= 500 ||
      error.code === 'ETIMEDOUT'
  }
);

Don't retry 4xx errors (that's your bug), but do retry rate limits and server errors.

Cost Tracking Per Agent Call

Every single LLM call should log:

Input tokens
Output tokens
Total cost (calculated)
Model used
Call duration
Success/failure status

Langfuse handles this automatically, but you need to instrument it:

import { langfuse } from '@yabasha/cas/observability';

const trace = langfuse.trace({ name: 'order-assistant', userId: request.userId });
const generation = trace.generation({
  name: 'classify-intent',
  model: 'gpt-4',
  input: userMessage,
});

try {
  const result = await classifyIntent(userMessage);
  generation.end({ output: result, usage: result.usage });
} catch (error) {
  generation.end({ error: error.message });
  throw error;
}

Now you can answer: "We spent $47 on LLM calls yesterday. $38 of that came from the order-assistant agent. $12 of that was from retry loops."

Response Caching

Many agent calls are deterministic given the same input. Cache them:

const cacheKey = hash({ prompt, model, temperature: 0 }); // deterministic only
const cached = await cache.get(cacheKey);
if (cached) return cached;

const result = await generateObject({ /* ... */ });
await cache.set(cacheKey, result, { ttl: '1h' });

Use temperature=0 for cacheable calls, temperature>0 for creative/generative tasks only.

Observability: You Can't Fix What You Can't See

Production agents are distributed systems. Every call is a potential failure point. You need:

Langfuse Tracing

Langfuse gives you distributed traces across your agent execution. You can see:

The full execution graph of an agent run
Token usage and cost at every step
Latency breakdown (where is time actually spent?)
Input/output for debugging

// In your Convex action
export const runAgent = action({
  args: { conversationId: v.id('conversations') },
  handler: async (ctx, args) => {
    const trace = langfuse.trace({ 
      name: 'customer-support-agent',
      metadata: { conversationId: args.conversationId }
    });
    
    // Each step gets its own span
    const intentSpan = trace.span({ name: 'classify-intent' });
    const intent = await classifyIntent(ctx, args);
    intentSpan.end();
    
    const toolSpan = trace.span({ name: 'execute-tools' });
    const result = await executeTools(ctx, intent);
    toolSpan.end();
    
    trace.update({ status: 'success' });
    return result;
  },
});

Eval Loops

You need automated evaluation running continuously. Not just "did it work?" but "was it good?"

I run three types of evals:

Deterministic checks: Did the output match the schema? Did it contain required fields?
LLM-as-judge: Use a cheap model (like gpt-4o-mini) to grade outputs against criteria
Human spot-checks: Sample 1% of production traffic for manual review

// In packages/evals/src/evaluators.ts
export const intentAccuracyEval = defineEval({
  name: 'intent-classification-accuracy',
  dataset: 'intent-classification-test-set',
  evaluator: async ({ input, expected }) => {
    const result = await classifyIntent(input);
    const judge = await generateObject({
      model: openai('gpt-4o-mini'),
      schema: z.object({ correct: z.boolean(), reason: z.string() }),
      prompt: `Did the model classify "${result.intent}" correctly? Expected: ${expected}`,
    });
    return { score: judge.correct ? 1 : 0, reason: judge.reason };
  },
});

Run these in CI on every push, and nightly against production samples.

Cost Attribution

Know which features cost what:

// Tag every trace with the feature that triggered it
langfuse.trace({
  name: 'support-agent',
  tags: ['feature:order-lookup', 'environment:production'],
});

Now you can query: "How much did the order-lookup feature cost in March?" This is essential for pricing decisions and capacity planning.

The Agent Loop Pattern

After trying various architectures, I settled on a consistent loop pattern:

spawn → execute → validate → retry or escalate

Spawn: Create a trace/span for this agent run. Load context (conversation history, user profile). Set guardrails.

Execute: Run the core agent logic (intent classification, tool calls, response generation). This is where the LLM calls happen.

Validate: Check the output against schema, business rules, and safety constraints. Is this safe to return to the user? Is it accurate?

Retry or Escalate:

If validation fails and we have retries left: fix the input (add clarification), retry
If validation fails and we're out of retries: return a graceful fallback response
If this is a critical failure: escalate to human and alert

This loop runs for every user request. It's not elegant, but it's reliable.

Real Deployment Considerations

Timeouts

LLM calls are slow. Set aggressive timeouts:

// In Convex actions
export const agentAction = action({
  args: { message: v.string() },
  returns: v.promise(v.string()),
  handler: async (ctx, args) => {
    // Convex has a 30s action timeout
    // We budget 25s for LLM calls, 5s for overhead
    const timeoutPromise = new Promise((_, reject) => 
      setTimeout(() => reject(new Error('AGENT_TIMEOUT')), 25000)
    );
    
    return Promise.race([
      runAgent(args.message),
      timeoutPromise
    ]);
  },
});

Have a fallback ready when the timeout hits. Don't leave users hanging.

Rate Limits

Every LLM provider has different rate limits. Track your usage:

// Simple token bucket in Redis/Upstash Redis
const rateLimiter = new TokenBucket({
  name: 'openai-requests',
  capacity: 100,  // requests per minute
  refillRate: 100/60,  // per second
});

if (!(await rateLimiter.take())) {
  // Queue for later or fallback to cached/cheaper model
  await queueForLater(request);
  return { status: 'queued', message: 'Processing in background' };
}

Graceful Degradation

When everything fails, what does the user see?

Have a dead-simple fallback agent:

export const fallbackAgent = async (userMessage: string) => {
  // Uses no tools, just responds politely
  return generateText({
    model: openai('gpt-3.5-turbo'),  // cheap, fast
    prompt: `We're experiencing technical difficulties. Respond helpfully to: "${userMessage}" without making commitments.`,
  });
};

This keeps the conversation going while you fix the underlying issue.

What CAS Provides Out of the Box

The Composable AI Stack (CAS) encodes all these patterns so you don't have to rebuild them:

Pattern	CAS Implementation
Task Decomposition	`packages/agents` with composable step functions
Tool Calling	Vercel AI SDK integration with Zod schemas
Guardrails	`@yabasha/cas/guardrails` with input/output validators
Retry Logic	Built into the LLM client layer
Cost Tracking	Langfuse integration with per-call attribution
Observability	Full tracing + eval harness in `packages/evals`
Agent Loop	`AgentRunner` class implementing spawn→execute→validate→retry
Timeouts	Convex action-level + client-level timeout handling
Fallback Chains	Configurable fallback strategies per agent

Initialize a new project:

bun add -g @yabasha/cas
cas init my-agent-system
cd my-agent-system
bun install
cas dev

You get a working agent system with all the production patterns already wired up. The examples include a customer support agent, a data analysis agent, and a code review agent — each showing different patterns for decomposition, tool use, and error handling.

TL;DR: Production Agent Checklist

What's Next

Agent systems are still maturing. The tools are getting better (Vercel AI SDK, Langfuse, Convex), but the hard part remains architecture and observability. The teams that win won't be the ones with the cleverest prompts. They'll be the ones with the most reliable systems.

If you're building agents for production, start with the failure modes. Design your system to degrade gracefully. Instrument everything. And for the love of all that is holy, don't copy the 10-minute tutorial code into your production repo.

Questions? I'm @yabasha on Twitter/X, or reach out through yabasha.dev.