Production-Grade AI Agents That Won't Break at 3 AM

Most AI agent tutorials stop right before the part where everything falls apart. This one doesn't.

Apr 26, 2026

Every AI agent demo I’ve seen works perfectly. The agent calls a tool, gets a response, formats it nicely, done. Fifteen seconds, clean terminal output, applause.

Then you deploy it. And it calls the same tool four times in a loop because the LLM hallucinated a retry instruction. Or it silently eats an error and returns a confident, completely wrong answer to your user. Or it runs for 47 minutes burning tokens on a task that should take 10 seconds.

I’ve been building agent-based systems for the past few months, and the gap between “works in my notebook” and “runs in production without waking me up” is enormous. This is my attempt to write down what I’ve actually learned about closing that gap. Some of this I’m confident about. Some of it I’m still figuring out.

The problem nobody talks about in agent tutorials

AI agents are stateful, non-deterministic processes that make decisions at runtime. That sentence sounds obvious, but it has consequences that most tutorials skip.

A traditional API endpoint receives a request, does some work, returns a response. The work is predictable. You can write tests for it. You can set timeouts. You know the blast radius.

An agent is different. It decides what to do next based on LLM output, which means you can’t fully predict the execution path. It might call one tool or five. It might finish in 2 seconds or loop for a minute. It might encounter an error from an external API and decide (on its own) to retry, or to try a completely different approach, or to give up and hallucinate an answer.

This is why durability matters so much for agents. Not durability in the “survives a server restart” sense (though that too), but durability in the broader sense: the agent should behave predictably even when the world around it doesn’t.

Step 1: Put boundaries on everything

Before you think about orchestration patterns or fancy frameworks, the single most useful thing you can do is constrain your agent’s behavior.

I mean this literally. Set hard limits on:

Maximum number of LLM calls per task (I usually start with 10 and adjust)
Maximum wall-clock time per agent run
Maximum tokens spent per run
Maximum number of tool invocations

Without these, a confused agent will happily burn through your entire monthly API budget in one run. I’ve seen it happen. Not to me, thankfully. Okay, once to me.

Here’s what a simple bounded agent loop looks like in TypeScript:

async function runAgent(task: string, tools: Tool[], options: AgentOptions) {
  const maxSteps = options.maxSteps ?? 10;
  const maxDurationMs = options.maxDurationMs ?? 30_000;
  const startTime = Date.now();
  const messages: Message[] = [{ role: "user", content: task }];
  let steps = 0;

  while (steps < maxSteps) {
    if (Date.now() - startTime > maxDurationMs) {
      return { status: "timeout", steps, messages };
    }

    const response = await callLLM(messages, tools);
    messages.push(response);
    steps++;

    if (response.toolCalls && response.toolCalls.length > 0) {
      for (const call of response.toolCalls) {
        const result = await executeToolWithTimeout(call, 5000);
        messages.push({ role: "tool", content: result, toolCallId: call.id });
      }
    } else {
      return { status: "complete", steps, messages };
    }
  }

  return { status: "max_steps_exceeded", steps, messages };
}

Nothing fancy. But notice the return type always includes status. That's the first principle: every agent run should terminate with an explicit status, not just a response. You need to know whether it finished, timed out, or hit a limit. This is the thing that makes the difference between "it worked" and "I can monitor and alert on it."

Step 2: Make tool execution the reliability boundary

Your agent is only as reliable as its tools. And tools fail. APIs return 500s, databases time out, rate limits kick in.

The pattern I’ve found most useful: wrap every tool in its own error boundary, with its own timeout, and return structured results regardless of success or failure. The LLM is surprisingly good at handling “this tool failed with error X” if you give it that information cleanly. What it’s terrible at is handling a thrown exception that kills the entire agent loop.

async function executeToolWithTimeout(
  call: ToolCall,
  timeoutMs: number
): Promise<string> {
  const controller = new AbortController();
  const timer = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const tool = toolRegistry.get(call.name);
    if (!tool) {
      return JSON.stringify({
        error: true,
        message: `Unknown tool: ${call.name}`,
      });
    }

    const result = await tool.execute(call.arguments, {
      signal: controller.signal,
    });
    return JSON.stringify({ error: false, data: result });
  } catch (err) {
    const message =
      err instanceof Error ? err.message : "Tool execution failed";
    return JSON.stringify({ error: true, message });
  } finally {
    clearTimeout(timer);
  }
}

The key insight: never throw from tool execution. Always return a structured result. Let the LLM decide what to do with failures. This is one of those things I’m quite certain about after watching agents in production for a while.

Step 3: Think about durability for long-running agents

Short agents that finish in a few seconds? The pattern above is probably enough. But once agents start running for minutes, or need to survive server restarts, or coordinate with other agents, you need something more. This is where the concept of durable execution comes in. If the process dies after step 3 of 7, you should be able to resume from step 3 instead of starting over.

I think this matters more than most people realize. In serverless environments especially, your function might get killed by the platform after a timeout. Without checkpointing, that’s a complete waste of every token and API call that already happened.

The principle is straightforward even if you don’t use a specific durability framework. After each significant step (LLM call, tool result, decision point), persist the agent’s state somewhere. A database, a queue, a file. Whatever your infrastructure supports. Then build your agent loop to accept a “resume from” parameter.

I’m not going to pretend I’ve nailed this perfectly. My current approach is to store the full message history after each step in Postgres, with a run ID and step number. If the process crashes, a recovery worker picks up incomplete runs and resumes them. It’s not elegant but it works.

Step 4: Parallel agents are powerful and dangerous

The Pragmatic Engineer blog recently covered an interesting trend: developers kicking off multiple AI agents in parallel to work on different parts of a codebase simultaneously. The idea is that instead of one agent doing everything sequentially, you split the work and let multiple agents tackle sub-tasks at the same time.

I’ve been experimenting with this and it’s genuinely useful. But it introduces failure modes that sequential agents don’t have.

The obvious one: what happens when agent 3 out of 5 fails? Do you retry just that one? Do you cancel all of them? Does the output of agent 3 depend on agents 1 and 2?

Here’s the pattern I’ve settled on for parallel agent work:

interface AgentTask {
  id: string;
  prompt: string;
  tools: Tool[];
  dependsOn?: string[];
}

async function runParallelAgents(tasks: AgentTask[]): Promise<Map<string, AgentResult>> {
  const results = new Map<string, AgentResult>();
  const pending = new Map(tasks.map((t) => [t.id, t]));

  while (pending.size > 0) {
    const ready: AgentTask[] = [];

    for (const [id, task] of pending) {
      const depsResolved = (task.dependsOn ?? []).every(
        (dep) => results.has(dep) && results.get(dep)!.status === "complete"
      );
      if (depsResolved) ready.push(task);
    }

    if (ready.length === 0 && pending.size > 0) {
      for (const [id] of pending) {
        results.set(id, { status: "blocked", steps: 0, messages: [] });
        pending.delete(id);
      }
      break;
    }

    const batchResults = await Promise.allSettled(
      ready.map(async (task) => {
        const depContext = (task.dependsOn ?? [])
          .map((dep) => results.get(dep))
          .filter(Boolean);
        const contextualPrompt = buildContextualPrompt(task.prompt, depContext);
        const result = await runAgent(contextualPrompt, task.tools, {
          maxSteps: 10,
          maxDurationMs: 30_000,
        });
        return { id: task.id, result };
      })
    );

    for (const settled of batchResults) {
      if (settled.status === "fulfilled") {
        results.set(settled.value.id, settled.value.result);
        pending.delete(settled.value.id);
      } else {
        const failedTask = ready.find(
          (t) => !results.has(t.id)
        );
        if (failedTask) {
          results.set(failedTask.id, {
            status: "error",
            steps: 0,
            messages: [],
          });
          pending.delete(failedTask.id);
        }
      }
    }
  }

  return results;
}

Notice the dependency graph. Some agents can run in parallel, but others depend on earlier results. The orchestrator resolves dependencies, runs independent tasks concurrently, and handles failures without killing the entire batch.

I’m going to be honest: the error handling here is something I’m still iterating on. The “blocked” status when dependencies can’t be resolved feels like the right thing, but I haven’t tested it under enough real scenarios to be certain.

Step 5: Observe everything, trust nothing

Remember the observability point from step 1? It comes back here, and it’s even more important with agents than with normal services.

For every agent run, I log:

Total steps taken
Total tokens consumed (prompt and completion separately)
Wall-clock duration
Which tools were called and how many times
The terminal status (complete, timeout, max_steps, error)
Whether the agent retried any tool calls

This is how you catch the patterns that kill you. “Hey, the invoice-processing agent has been averaging 8 steps for the past week, but today it’s averaging 14.” That’s your early warning. Something changed in the data, or the LLM is behaving differently, or a downstream API is returning errors that cause retries.

Without these metrics, you’ll find out when your token bill arrives. Or when a user complains. Or at 3 AM.

One thing I keep going back to: the bounded execution from step 1 is what makes observability useful. If an agent can run unbounded, your metrics are meaningless because the variance is infinite. Boundaries give you a normal range to compare against.

Step 6: Test the failure modes, not just the happy path

This is the part most people skip and it’s the part that matters most.

Your tests for an agent system should include:

What happens when the LLM returns malformed tool calls?
What happens when a tool times out on every invocation?
What happens when the agent hits its step limit without completing the task?
What happens when two parallel agents try to modify the same resource?
What happens when the LLM decides to call a tool that doesn’t exist?

I write these as integration tests with a mock LLM that returns predefined sequences. It’s not perfect because you can’t predict every weird thing a real LLM will do. But it catches the structural failures: the ones where your orchestration logic breaks, not where the LLM says something dumb.

For the LLM-says-something-dumb cases, I rely on the boundaries from step 1 and the observability from step 5. You can’t test for every hallucination. But you can make sure hallucinations don’t cause unbounded damage.

What I’d do differently starting from scratch

If I were building a new agent system today, I’d start with the boring stuff first. Timeouts, structured tool results, status tracking, logging. Then I’d add the actual agent logic on top.

Most teams do it the other way around. They get the agent working, it’s exciting, it does cool things. Then they spend three months retrofitting all the production-hardening stuff. I’ve done this. It’s painful. The guardrails are much easier to build when you design around them from the start.

I’d also think carefully about whether I actually need agents at all. A lot of problems that people solve with agents can be solved with a well-structured prompt and a single LLM call. Agents add complexity. Every step in an agent loop is a place where things can go wrong. If your task doesn’t require dynamic tool selection or multi-step reasoning, a simpler approach is almost always better.

That said, when you do need agents (and there are real cases where you do), building them with durability in mind from day one will save you more headaches than any framework or library choice.

Where I’m still figuring things out

I don’t have a great answer for agent memory yet. For short tasks, passing the full message history works fine. For agents that run across multiple sessions or need to remember things from days ago, I’m experimenting with summarization and retrieval patterns, but nothing feels solid yet.

I also don’t have strong opinions on agent frameworks. There are a lot of them. Some seem good, some seem like thin wrappers around API calls with a lot of abstraction for abstraction’s sake. I’ve been writing my own orchestration code because it helps me understand the failure modes, but I could be wrong that this is the best use of my time.

And multi-agent coordination where agents communicate with each other, not just run in parallel, is something I’ve read about more than I’ve built. Projects like Wuphf (which uses Git and Markdown files as a shared knowledge base between agents) are interesting because they solve the coordination problem through a shared artifact instead of direct communication. That feels right to me, but I haven’t tested it enough to recommend it.

The honest summary: if you get the basics right (boundaries, structured tool results, observability, explicit status tracking), you can build agent systems that run in production without constant babysitting. The fancy orchestration patterns matter less than you’d think. The boring reliability patterns matter more.

Build the guardrails first. Then let the agents loose inside them.

canary code

Discussion about this post

Ready for more?