Move beyond fragile AI demos with this guide to building production-ready agents. Master task decomposition, guardrails, and observability for reliable systems.

The cobbler's children have no shoes.
I've watched this play out dozens of times: a team ships a slick AI demo, investors nod approvingly, then the first real user hits an edge case and the whole thing collapses. Hallucinations multiply. Context windows explode. API costs spike 400% overnight. The demo looked brilliant; the production system is a liability.
This isn't a failure of LLM technology. It's a failure of architecture.
If you're building AI agents that need to survive contact with real users, real data, and real error conditions, you need more than a clever prompt. You need systems thinking. This is what I've learned from shipping agents that actually stay up.
Walk through any "Build an AI Agent in 10 Minutes" tutorial and you'll see the same pattern:
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: userInput }
],
tools: [someTool],
});That's not an agent. That's a function call with extra steps.
What breaks in production:
Production agents aren't about the happy path. They're about gracefully handling every possible failure mode while staying within budget.
After several iterations of broken agents, I landed on a pattern that holds up: task decomposition + tool calling + guardrails + fallback chains.
Don't ask an LLM to do everything at once. Break the job into discrete, verifiable steps:
User Request → Intent Classification → Parameter Extraction → Tool Selection → Tool Execution → Response Synthesis
Each step is a separate LLM call with a focused prompt. This is slower than one megaprompt, but it's debuggable, testable, and you can cache intermediate results.
Vercel AI SDK's generateObject is your friend here. Define exactly what you expect:
import { generateObject } from 'ai';
import { z } from 'zod';
const ToolCallSchema = z.object({
tool: z.enum(['search_orders', 'create_ticket', 'escalate']),
params: z.record(z.unknown()),
confidence: z.number().min(0).max(1),
});
const { object: toolCall } = await generateObject({
model: openai('gpt-4'),
schema: ToolCallSchema,
prompt: classifyIntent(userMessage),
});If the model can't produce valid JSON within the schema, the generateObject layer retries or fails gracefully. No more parsing broken tool calls at 3 AM.
Every agent call needs two checkpoints:
Input validation: Is this request safe to process? Rate limit check. PII scan. Token count validation. Malformed input rejection.
Output validation: Did the model return what we asked for? Schema match. Sanity checks (e.g., a "refund amount" shouldn't exceed the order total). Hallucination detection for specific fields.
I wrap every LLM call in a guardrail function:
async function withGuardrails<T>(
call: () => Promise<T>,
validateInput: (input: unknown) => void,
validateOutput: (output: T) => void,
fallback: () => T
): Promise<T> {
try {
validateInput(input);
const result = await call();
validateOutput(result);
return result;
} catch (error) {
// Log to Langfuse, trigger alert, return fallback
return fallback();
}
}When the primary model fails, have a Plan B:
The key is deciding your fallback strategy before you need it, not during an incident.
Here are the specific patterns I use in every production agent system:
LLM APIs fail. Rate limits hit. Network blips. Handle it:
import { retry } from 'ts-retry-promise';
const result = await retry(
() => generateObject({ /* ... */ }),
{
retries: 3,
delay: 1000,
backoff: 'EXPONENTIAL',
retryIf: (error) =>
error.status === 429 ||
error.status >= 500 ||
error.code === 'ETIMEDOUT'
}
);Don't retry 4xx errors (that's your bug), but do retry rate limits and server errors.
Every single LLM call should log:
Langfuse handles this automatically, but you need to instrument it:
import { langfuse } from '@yabasha/cas/observability';
const trace = langfuse.trace({ name: 'order-assistant', userId: request.userId });
const generation = trace.generation({
name: 'classify-intent',
model: 'gpt-4',
input: userMessage,
});
try {
const result = await classifyIntent(userMessage);
generation.end({ output: result, usage: result.usage });
} catch (error) {
generation.end({ error: error.message });
throw error;
}Now you can answer: "We spent $47 on LLM calls yesterday. $38 of that came from the order-assistant agent. $12 of that was from retry loops."
Many agent calls are deterministic given the same input. Cache them:
const cacheKey = hash({ prompt, model, temperature: 0 }); // deterministic only
const cached = await cache.get(cacheKey);
if (cached) return cached;
const result = await generateObject({ /* ... */ });
await cache.set(cacheKey, result, { ttl: '1h' });Use temperature=0 for cacheable calls, temperature>0 for creative/generative tasks only.
Production agents are distributed systems. Every call is a potential failure point. You need:
Langfuse gives you distributed traces across your agent execution. You can see:
// In your Convex action
export const runAgent = action({
args: { conversationId: v.id('conversations') },
handler: async (ctx, args) => {
const trace = langfuse.trace({
name: 'customer-support-agent',
metadata: { conversationId: args.conversationId }
});
// Each step gets its own span
const intentSpan = trace.span({ name: 'classify-intent' });
const intent = await classifyIntent(ctx, args);
intentSpan.end();
const toolSpan = trace.span({ name: 'execute-tools' });
const result = await executeTools(ctx, intent);
toolSpan.end();
trace.update({ status: 'success' });
return result;
},
});You need automated evaluation running continuously. Not just "did it work?" but "was it good?"
I run three types of evals:
// In packages/evals/src/evaluators.ts
export const intentAccuracyEval = defineEval({
name: 'intent-classification-accuracy',
dataset: 'intent-classification-test-set',
evaluator: async ({ input, expected }) => {
const result = await classifyIntent(input);
const judge = await generateObject({
model: openai('gpt-4o-mini'),
schema: z.object({ correct: z.boolean(), reason: z.string() }),
prompt: `Did the model classify "${result.intent}" correctly? Expected: ${expected}`,
});
return { score: judge.correct ? 1 : 0, reason: judge.reason };
},
});Run these in CI on every push, and nightly against production samples.
Know which features cost what:
// Tag every trace with the feature that triggered it
langfuse.trace({
name: 'support-agent',
tags: ['feature:order-lookup', 'environment:production'],
});Now you can query: "How much did the order-lookup feature cost in March?" This is essential for pricing decisions and capacity planning.
After trying various architectures, I settled on a consistent loop pattern:
spawn → execute → validate → retry or escalate
Spawn: Create a trace/span for this agent run. Load context (conversation history, user profile). Set guardrails.
Execute: Run the core agent logic (intent classification, tool calls, response generation). This is where the LLM calls happen.
Validate: Check the output against schema, business rules, and safety constraints. Is this safe to return to the user? Is it accurate?
Retry or Escalate:
This loop runs for every user request. It's not elegant, but it's reliable.
LLM calls are slow. Set aggressive timeouts:
// In Convex actions
export const agentAction = action({
args: { message: v.string() },
returns: v.promise(v.string()),
handler: async (ctx, args) => {
// Convex has a 30s action timeout
// We budget 25s for LLM calls, 5s for overhead
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error('AGENT_TIMEOUT')), 25000)
);
return Promise.race([
runAgent(args.message),
timeoutPromise
]);
},
});Have a fallback ready when the timeout hits. Don't leave users hanging.
Every LLM provider has different rate limits. Track your usage:
// Simple token bucket in Redis/Upstash Redis
const rateLimiter = new TokenBucket({
name: 'openai-requests',
capacity: 100, // requests per minute
refillRate: 100/60, // per second
});
if (!(await rateLimiter.take())) {
// Queue for later or fallback to cached/cheaper model
await queueForLater(request);
return { status: 'queued', message: 'Processing in background' };
}When everything fails, what does the user see?
Have a dead-simple fallback agent:
export const fallbackAgent = async (userMessage: string) => {
// Uses no tools, just responds politely
return generateText({
model: openai('gpt-3.5-turbo'), // cheap, fast
prompt: `We're experiencing technical difficulties. Respond helpfully to: "${userMessage}" without making commitments.`,
});
};This keeps the conversation going while you fix the underlying issue.
The Composable AI Stack (CAS) encodes all these patterns so you don't have to rebuild them:
| Pattern | CAS Implementation |
|---|---|
| Task Decomposition | packages/agents with composable step functions |
| Tool Calling | Vercel AI SDK integration with Zod schemas |
| Guardrails | @yabasha/cas/guardrails with input/output validators |
| Retry Logic | Built into the LLM client layer |
| Cost Tracking | Langfuse integration with per-call attribution |
| Observability | Full tracing + eval harness in packages/evals |
| Agent Loop | AgentRunner class implementing spawn→execute→validate→retry |
| Timeouts | Convex action-level + client-level timeout handling |
| Fallback Chains | Configurable fallback strategies per agent |
Initialize a new project:
bun add -g @yabasha/cas
cas init my-agent-system
cd my-agent-system
bun install
cas devYou get a working agent system with all the production patterns already wired up. The examples include a customer support agent, a data analysis agent, and a code review agent — each showing different patterns for decomposition, tool use, and error handling.
Agent systems are still maturing. The tools are getting better (Vercel AI SDK, Langfuse, Convex), but the hard part remains architecture and observability. The teams that win won't be the ones with the cleverest prompts. They'll be the ones with the most reliable systems.
If you're building agents for production, start with the failure modes. Design your system to degrade gracefully. Instrument everything. And for the love of all that is holy, don't copy the 10-minute tutorial code into your production repo.
Questions? I'm @yabasha on Twitter/X, or reach out through yabasha.dev.
<!-- meta_title: "Building Production AI Agents That Actually Work | Yabasha" meta_description: "Stop copying toy demos. Learn the architecture, patterns, and observability practices that make AI agents reliable in production. Includes code examples and the Composable AI Stack." primary_keyword: "production ai agents" secondary_keywords: ["ai agent architecture", "llm production patterns", "vercel ai sdk", "langfuse tracing", "ai agent guardrails", "typescript ai agents", "convex backend"] ai_summary_short: "A practical guide to building reliable AI agents for production, covering architecture patterns (task decomposition, guardrails, fallback chains), observability with Langfuse, cost tracking, and the Composable AI Stack." ai_level: "intermediate" ai_intent: "learn production patterns for ai agents" -->
AI Engineer & Full-Stack Tech Lead
Expertise: 20+ years full-stack development. Specializing in architecting cognitive systems, RAG architectures, and scalable web platforms for the MENA region.



