A first-person overview of production AI engineering work spanning RAG systems, agent architectures, full-stack Laravel/Next.js delivery, and cost optimization — built for real traffic, not tutorials.

I have been building software for over twenty years. For the past several, that work has centered on production AI systems — the kind that handle real traffic, real data, and real failure modes. This post is a straight inventory of what I actually ship, how I architect it, and what breaks when you move beyond tutorials.
Retrieval-Augmented Generation is the backbone of most production AI products I build. But a RAG pipeline that works on a demo PDF is not the same as one that ingests ten thousand heterogeneous documents with mixed Arabic and English content, tables, headers, and inconsistent formatting.
The default "split every 1000 characters" approach fails the moment your source material has structure. I use recursive character splitting as a baseline, but the real work happens in semantic chunking — splitting at natural boundaries like section headers, paragraph breaks, and logical transitions. For structured documents, I preserve metadata (page numbers, section titles, source filenames) and attach it to every chunk so the retrieval layer can filter and rank with context.
For Arabic content, chunking gets harder. Arabic morphology is rich, sentence boundaries are not always obvious, and mixed Arabic-English documents create embedding drift if you are not careful. I typically chunk Arabic text slightly smaller than English equivalents to preserve semantic coherence across the embedding model's context window.
I have shipped systems on both Pinecone and pgvector. Pinecone wins when you need managed scaling and hybrid search out of the box. pgvector wins when your retrieval layer is already tightly coupled to PostgreSQL, when you need transactional consistency between your application data and your vectors, or when you are optimizing for cost at scale.
My default embedding stack is text-embedding-3-large for English and multilingual content, with fallback to text-embedding-3-small for high-volume, lower-precision use cases. For Arabic-specific deployments, I evaluate Arabic-specific embedding models against a held-out test set before committing — generic multilingual models are better than they used to be, but they are not always the right tool.
Cosine similarity on a single embedding is rarely enough. I run hybrid search — combining dense vector similarity with sparse BM25 keyword matching — and then re-rank with a cross-encoder or a lightweight reranking model. The dense retrieval gets you recall; the re-ranker gets you precision. Without the second step, you get plausible-sounding but irrelevant chunks at the top of your context window.
I also implement query expansion and hypothetical document embedding (HyDE) where appropriate: generate a synthetic answer to the user's question, embed that, and retrieve against it. It costs an extra LLM call, but the retrieval quality improvement is measurable.
Continue Reading
Agents are not magic. They are loops — a language model, a set of tools, and a control structure that decides what to call next. The hard part is making that loop reliable enough to expose to users.
I build agents with explicit tool schemas, not free-form prompting. Every tool has a typed input schema, a clear description, and a deterministic output. The LLM generates structured JSON to invoke tools, and I validate that JSON against the schema before execution. If the model hallucinates a parameter or sends malformed JSON, the validation layer catches it and returns a structured error back to the agent loop.
Tools fall into three categories in my systems: data retrieval (search, database queries), action execution (send email, create ticket, update record), and computation (math, formatting, validation). I never give an agent unrestricted write access without human confirmation — the blast radius is too large.
Every agent has input guardrails: PII detection, prompt injection filtering, and topic boundary checks. Every agent has output guardrails: format validation, safety checks, and confidence thresholds. If the model's output confidence is below a threshold, or if validation fails, the system falls back to a simpler chain or escalates to a human.
I implement fallback chains as explicit state machines, not implicit prompt engineering. If the primary model fails, the system tries a smaller model with a narrower prompt. If that fails, it returns a graceful degradation message and logs the incident for review.
You cannot debug what you cannot see. I instrument every agent trace with Langfuse: input prompts, tool calls, latencies, token counts, and error states. This is not optional in production. When a user reports a bad result, I need to reconstruct the exact chain of thought, see which tool was called with what parameters, and identify where the logic diverged from the intended path.
Langfuse also gives me cost attribution per trace, which matters when you are running multiple models and need to know which user workflows are expensive.
AI does not exist in a vacuum. It sits inside applications that have authentication, billing, permissions, mobile clients, and deployment pipelines. I build the full stack, not just the model layer.
My default backend is Laravel 12. It handles queues, scheduling, database migrations, API authentication, and event broadcasting without ceremony. For the frontend, I use Next.js 16 with the App Router, server components for initial data fetching, and client components for interactive AI chat interfaces.
The monorepo structure keeps the API contracts tight. Laravel exposes typed API resources; Next.js consumes them with generated TypeScript types. When the backend schema changes, the frontend build breaks immediately — which is exactly what I want.
When the product needs a mobile presence, I ship React Native apps that share business logic with the web frontend. The AI chat interface, streaming responses, and tool-call UIs are implemented once and adapted to mobile constraints. Push notifications for async agent completions are handled through Laravel's notification channels.
Tutorials show you the happy path. Production is the other 90% of the work.
LLM costs scale linearly with traffic and super-linearly with context window size. A naive RAG system that dumps 8,000 tokens of retrieved context into every request will bankrupt you at scale. I implement aggressive context compression, selective retrieval (only fetch what you need), and model routing: use the cheapest model that can handle the task, and escalate to expensive models only when necessary.
Users will not wait five seconds for a chat response. I optimize latency at every layer: faster embedding models, cached retrieval results, streaming responses to the client, and pre-computed summaries for common queries. For high-latency operations — like multi-step agent workflows — I switch to an async pattern: acknowledge the request, process in the background, and notify the user when complete.
Hallucinations do not go away with better prompting. You need grounding: every generated claim must be traceable to a retrieved source. I implement citation requirements in the prompt, parse citations from the model output, and verify that cited chunks actually exist in the retrieval set. If the model cannot cite a source, the answer is flagged for review.
Here is a simplified view of a production RAG + Agent system I have shipped:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Next.js 16 │────▶│ Laravel 12 │────▶│ Vector Store │
│ (Frontend) │ │ (API / Queue) │ │ (pgvector) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌──────────────────┐
│ AI Agent Loop │
│ (Tool Calling) │
└──────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Search │ │ Action │ │ LLM API │
│ (RAG) │ │ Tools │ │ (Router) │
└──────────┘ └──────────┘ └──────────┘
The LLM API layer routes between models based on task complexity: small model for classification and routing, large model for generation and reasoning. Every call is traced, every cost is logged, every failure triggers a fallback.
Model routing is the single biggest lever. I classify incoming requests by complexity and route them to the appropriate model tier. Simple queries hit a fast, cheap model. Complex reasoning escalates to a larger model. This alone can reduce costs by 60-70% without measurable quality loss.
Caching is the second lever. Embedding results, retrieval sets, and common LLM responses are cached with TTLs appropriate to the data freshness requirements. For batch workloads — like nightly document ingestion — I use batching APIs where available to cut per-request overhead.
Building AI for the Middle East and North Africa introduces constraints that Western-centric tutorials ignore. Arabic NLP requires handling diacritics, dialectal variation, and right-to-left UI flows. Compliance requirements vary by country: data residency, content moderation standards, and regulatory approval for AI-generated outputs.
I design systems with regional modularity from the start: embedding models that handle Arabic well, UI components that support RTL layouts, and deployment architectures that can pin data to specific geographic regions when required. Scaling in MENA also means optimizing for mobile-first usage and variable network conditions — lightweight APIs, aggressive caching, and offline-capable mobile features.
Shipping AI features without evaluation is reckless. I run automated quality checks on every deployment: retrieval accuracy (did we fetch the right chunks?), answer relevance (does the generated response address the question?), citation correctness (are the sources real and relevant?), and safety (no PII leaks, no policy violations).
These checks run against a held-out evaluation dataset that grows over time. When a new failure mode appears in production, I add an example to the eval set and fix the system. The eval suite runs in CI before every deploy. If scores regress, the deploy is blocked.
If you are building AI products, the gap between prototype and production is wider than it looks. I bridge that gap — architecting systems that retrieve accurately, reason reliably, scale affordably, and fail gracefully. Whether you need a RAG pipeline for internal documents, an agent that interacts with your existing APIs, or a full-stack product with mobile and web clients, the patterns above are how I ship.
If you want to talk architecture, model selection, or regional deployment strategy, get in touch.

AI Engineer & Full-Stack Tech Lead
Expertise: 20+ years full-stack development. Specializing in architecting cognitive systems, RAG architectures, and scalable web platforms for the MENA region.
Practical AI + full-stack insights for MENA builders. No spam.



