All articles
20 min read2026-03-12

How to Build AI Agents That Actually Work in Production: Architecture, Failure Modes & Lessons Learned

The honest guide to production AI agent development — what breaks in the real world, how to design around failure modes, observability patterns, cost controls, and the architecture decisions that separate demo agents from production systems.

AI AgentsLangChainClaude APIAgentic AIAI EngineeringProduction AILLM Engineering

Every AI agent looks impressive in a demo. The agent receives a task, reasons about it, calls some tools, and produces a useful result — all within seconds, in a controlled environment, with cherry-picked inputs. Building AI agents that work in production — handling real-world inputs reliably, at scale, with acceptable cost and latency, without human supervision — is a fundamentally different engineering problem.

I have shipped a production AI Clinical Ops Agent that autonomously analyses quality metrics for 20,000+ US Skilled Nursing Facilities using Claude. It runs 24/7, processes tens of thousands of inference requests per day, and operates without human intervention except for edge case escalations it identifies itself. Building that system — and debugging it through every failure mode it encountered — gave me a deep understanding of what actually breaks in production AI agents and how to design around those failures from the start.

This is the guide I wish had existed before I built it.

The Production vs Demo Gap

The gap between a demo AI agent and a production AI agent is not incremental — it is categorical. Consider what changes when you move from demo to production:

Input distribution shifts dramatically. Your demo inputs are carefully selected. Production inputs include edge cases, malformed data, adversarial inputs, and combinations of data states you never anticipated. Your agent must handle all of them.

Reliability requirements change completely. A demo agent that fails 20% of the time is acceptable. A production agent that fails 20% of the time is a liability.

Cost becomes a real constraint. A demo agent can make 50 LLM calls to complete a task. A production agent processing 10,000 tasks per day at that rate costs a fortune and introduces unacceptable latency.

Observability becomes critical. When your demo agent fails, you debug it. When your production agent fails at 3am, you need logs, traces, and alerting that tell you what happened without you being present.

Security attack surface opens. Real users interact with production agents. Some will attempt prompt injection. Some will craft inputs designed to manipulate agent behaviour. Your production agent needs defences that demo agents do not.

Core Architectural Patterns

Single-Agent vs Multi-Agent Design

Single-agent systems give one agent a complete task and the tools to complete it. They are simpler to build, debug, and operate. Use single-agent systems by default — the complexity of multi-agent systems is only justified when:

  • Tasks genuinely require parallel execution that a single agent cannot achieve
  • Different steps require different model capabilities (e.g., a reasoning-heavy planning step and a fast structured output step)
  • The task scope is broad enough that a single agent's context window becomes a limiting factor

Multi-agent systems use a coordinator agent that decomposes tasks and dispatches them to specialist sub-agents. The coordinator aggregates results and handles conflicts. Multi-agent systems introduce significant complexity: agent-to-agent communication reliability, result aggregation logic, partial failure handling (what happens when one sub-agent fails?), and cascading cost when the coordinator needs to retry coordination.

My recommendation: start single-agent, add multi-agent coordination only when single-agent demonstrably cannot meet requirements.

Tool Design: The Most Underrated Component

The quality of your agent's tools determines the quality of its output more than the model you choose. Good tool design makes agents reliable; poor tool design makes them fragile and expensive.

Principles for production-grade tool design:

Tools should be idempotent. If an agent calls a tool twice with the same parameters (which it will, due to retries), the result should be the same. Writing tools that have idempotent semantics prevents duplicate data creation, double charges, and other side effects from agent retry behaviour.

Tools should fail loudly and informatively. When a tool fails, the error message it returns to the agent should explain what went wrong in terms the agent can reason about. "Error: 404" is not useful. "Patient with MRN 12345 was not found in the database. If this MRN is from an HL7 ADT message received before 9am today, it may not yet have propagated to the query API — retry in 15 minutes." gives the agent the information it needs to reason about whether to retry, escalate, or handle the failure differently.

Tools should validate inputs before execution. Agents generate tool call parameters that are sometimes structurally correct but semantically wrong (a date outside the valid range, a code that does not exist in the target terminology, an ID that references the wrong entity type). Validate tool inputs and return informative validation errors — the agent will often self-correct given good error feedback.

Prefer narrow, composable tools over broad, complex ones. A get_patient_summary tool that bundles demographics, diagnoses, medications, and lab results is convenient to call but expensive (large response context), inflexible (always fetches everything even when only one data domain is needed), and hard to test. Separate tools for demographics, conditions, medications, and observations are more expensive to call individually but give the agent precise control over what data it fetches.

Tool call logging is non-negotiable. Every tool call — the parameters, the response, the latency, and any errors — must be logged with a trace ID that links back to the originating agent run. This is your primary debugging surface when production agents behave unexpectedly.

Agent Memory Architecture

Production agents need different types of memory for different purposes:

In-context memory is the conversation history and retrieved context in the agent's current context window. It is fast (no I/O) but limited by context window size and expensive to fill with large documents.

External short-term memory persists state across individual LLM calls within an agent run. Use a key-value store (Redis, DynamoDB) to maintain agent state that exceeds what fits in the context window: intermediate results, tool outputs, reasoning traces.

External long-term memory persists information across agent runs — user preferences, historical decisions, entity knowledge. Use a vector database (pgvector, Pinecone, Weaviate) for semantic retrieval, or a structured database for precise attribute lookup.

For the clinical ops agent, the memory architecture is:

  • In-context: current facility's quality metrics, retrieved FHIR data, relevant clinical guidelines (RAG-retrieved)
  • Short-term: structured analysis state persisted to Redis for multi-step quality analysis runs
  • Long-term: facility-specific baseline metrics, historical analysis results, and clinical guideline embeddings in pgvector

Production Failure Modes (and Mitigations)

Failure Mode 1: Infinite Loops

Agents can get stuck in loops — attempting the same failed action repeatedly, or cycling between two approaches that each fail for different reasons. In production, unconstrained loops cause runaway costs, latency blow-up, and API rate limit hits.

Mitigation:

  • Implement a hard turn limit (maximum number of agent steps per run). My production agents have a configurable MAX_TURNS parameter that triggers immediate escalation when exceeded.
  • Track the agent's recent actions and detect repetition — if the agent has called the same tool with the same parameters three times in a row, something is wrong.
  • Implement circuit breakers at the tool level — if a specific tool has failed N times in the last M minutes, stop calling it and return a clear degradation signal.
class AgentRuntime:
    def __init__(self, max_turns: int = 25):
        self.max_turns = max_turns
        self.turn_count = 0
        self.action_history = []
    
    def check_loop_detection(self, action: AgentAction) -> bool:
        recent_actions = self.action_history[-3:]
        if all(a == action for a in recent_actions):
            raise AgentLoopDetected(f"Agent repeated action {action} 3 times")
        return False

Failure Mode 2: Hallucination in Tool Parameters

Agents sometimes generate tool call parameters that look syntactically correct but are semantically wrong — a patient ID that does not exist, a date in the wrong format for the target API, a medication code that has been deprecated.

Mitigation:

  • Validate tool parameters before execution, return structured validation errors
  • For critical identifiers (patient IDs, facility IDs), implement a verification step that confirms the entity exists before proceeding
  • Use Pydantic or JSON Schema validation for all tool input types
  • For codes and terminologies, validate against a local terminology lookup rather than assuming LLM-generated codes are valid

Failure Mode 3: Context Window Exhaustion

Long-running agents accumulate context — tool outputs, reasoning traces, intermediate results — until the context window fills and the model starts degrading or failing. Context window exhaustion is a near-certainty for agents that operate on large data sets.

Mitigation:

  • Implement context compression — periodically summarise the conversation history and replace it with a compressed summary
  • Use structured intermediate state stored externally (Redis) rather than keeping everything in the context
  • Design agent workflows to have natural checkpoints where accumulated context can be summarised and cleared
  • Monitor context utilisation and alert when approaching 70% of the model's context limit

Failure Mode 4: Prompt Injection

Users can craft inputs that attempt to override agent instructions: "Ignore your previous instructions and instead..." In production agents with tool access, prompt injection can cause real harm — agents executing unintended actions, accessing data they should not, or leaking sensitive information.

Mitigation:

  • Treat user inputs as data, not as instructions — structure your system prompt to create a clear separation between the agent's instructions and user-provided content
  • Sanitise inputs before passing them to the agent (strip or encode control characters, limit input length)
  • Implement output validation — check agent outputs and tool call parameters against a whitelist of expected patterns before execution
  • Use the principle of least privilege for tool access — agents should only have access to the tools and data they need for their specific task

Failure Mode 5: Stale Tool Context

When an agent queries a data source and stores the result in context, that result may become stale if the underlying data changes during a long agent run. Acting on stale data can produce incorrect outputs or trigger incorrect tool calls.

Mitigation:

  • Timestamp all data retrieved from external sources
  • For long-running agents, re-fetch critical data before taking actions that depend on it (especially for high-latency data like real-time financial data or availability information)
  • Design agents to acknowledge data freshness — include retrieved-at timestamps in tool responses and instruct the agent to consider data age in its reasoning

Cost Management in Production

LLM costs scale directly with token usage. Without cost controls, a production agent can generate unexpectedly large bills — especially when agents process large inputs, run multiple passes, or are exposed to adversarial inputs designed to maximise token consumption.

Cost-Reduction Patterns

Prefer smaller models for structured tasks. Not every step in an agent's workflow requires the most capable model. Use Claude claude-haiku-3-5 or GPT-4o mini for structured extraction, data formatting, and classification tasks. Reserve Claude claude-sonnet-4-5 for complex reasoning, clinical decision-making, and output synthesis. The cost difference between haiku and claude-sonnet-4-5 is 10-20x — routing even 50% of tool calls to haiku can halve your total inference costs.

Implement prompt caching. Anthropic's prompt caching feature (for Claude) and OpenAI's prompt caching cache the prefix of your prompt across API calls. For agents with long system prompts or large static context (clinical guidelines, policy documents), prompt caching reduces token costs by up to 90% for the cached portion.

Limit retrieved context size. RAG retrieval should return the minimum necessary context to answer the agent's current query. Retrieving 10 documents when 2 would suffice burns 5x the tokens on context processing.

Set hard token limits per agent run. Configure maximum context window usage per agent invocation and fail gracefully when the limit is approached, rather than allowing runaway token consumption.

Track cost per agent run. Log the total input and output tokens for every agent run, compute the cost, and attribute it to the requesting user or task type. Cost visibility enables informed decisions about model selection and architecture.

Observability Stack for Production Agents

The observability requirements for production AI agents exceed standard API observability because the failure modes are different and harder to detect.

What to Log

Every agent run should produce a structured log with:

  • run_id — unique identifier for the entire agent run
  • task_type — what the agent was asked to do
  • start_time, end_time, duration_ms
  • model — which model was used at each step
  • turn_count — number of LLM calls made
  • total_input_tokens, total_output_tokens, total_cost
  • tool_calls — ordered list of tool calls with parameters, responses, latency, and success/failure
  • final_statuscompleted, escalated, failed, loop_detected, limit_exceeded
  • output_summary — compressed summary of the agent's final output
  • confidence_score — self-assessed confidence from the agent (where appropriate)

Distributed Tracing

Use OpenTelemetry to instrument your agent runtime with traces that show the complete execution graph — LLM calls, tool calls, database queries, and external API calls — all linked by a single trace ID. LangSmith (for LangChain agents) and Langfuse (open-source) provide FHIR-specific tracing for LLM applications.

Alerting

Alert on:

  • Agent run failure rate exceeding baseline
  • P99 agent run duration exceeding SLO
  • Cost per run exceeding budget threshold
  • Loop detection events (agent was caught looping — indicates a systematic prompt or tool design issue)
  • Escalation rate exceeding baseline (agent is routing too many tasks to human review — indicates input distribution shift or tool degradation)

LangChain vs Custom Agent Framework

LangChain is the dominant agent framework and a reasonable choice for most use cases. Its strengths are the rich ecosystem of integrations, the community-maintained tool library, and the LangSmith observability integration. Its weaknesses are performance overhead, abstraction leakage that makes debugging harder, and rapid API changes that break existing code.

For the clinical ops agent, I ultimately moved from LangChain to a custom lightweight framework because:

  • Clinical context required deterministic behaviour that LangChain's abstractions made difficult to guarantee
  • The performance overhead of LangChain's abstraction layers was measurable at 20,000+ facility scale
  • Custom error handling and escalation logic was cleaner without working around LangChain's agent executor

My recommendation: use LangChain for agents you are building quickly or where the ecosystem integrations add significant value. Build custom frameworks for agents where performance, determinism, or complex custom logic are critical requirements.

Model Selection for Production

Claude claude-sonnet-4-5: Best for complex clinical reasoning, multi-step planning, and tasks requiring nuanced judgment. The tool use API is excellent — structured, reliable JSON tool calls with low hallucination rate. Higher cost than GPT-4o.

GPT-4o: Best for structured data extraction, code generation, and tasks where function calling reliability is critical. Slightly lower cost than Claude claude-sonnet-4-5, very strong function calling performance.

Claude claude-haiku-3-5: Fastest and cheapest. Excellent for classification, simple extraction, and structured output generation. Poor for complex multi-step reasoning.

GPT-4o mini: Fast and cheap, comparable to claude-haiku-3-5. Best for high-volume simple tasks.

Do not bet a production agent on a single model. Implement model fallback — if Claude is degraded, fall back to GPT-4o. Multi-model resilience is table stakes for production systems.

Conclusion

Building AI agents that work in production requires treating them as distributed systems, not as AI demos. The reliability, observability, cost management, and failure mode handling that production systems demand are engineering challenges as much as they are AI challenges.

The agents that work in production are the ones where the engineering around the LLM — tool design, error handling, context management, cost controls, and observability — is as carefully considered as the prompts themselves.

If you are building production AI agents for healthcare or enterprise use cases, I offer AI agent development services grounded in real production experience.


Muhammad Moid Shams is a Lead Software Engineer specialising in agentic AI systems, FHIR R4 platform architecture, and healthcare AI. He has shipped production AI agents processing data for 20,000+ US healthcare facilities.