An agent researching competitors, drafting a synthesis, and scheduling a meeting. Fourteen steps in, the container gets rescheduled. Four minutes of LLM calls, tool results, and intermediate reasoning — evaporated. The user sees a timeout. The framework logs "workflow failed." The demo had worked perfectly for three months.

Durable execution — the pattern where code automatically checkpoints its state and resumes after failure — has existed for years in workflow engines like Temporal. Always useful, always kind of boring. Then agents came along and turned it into a survival requirement.

The Compound Failure Math

Here's the arithmetic that makes retry logic insufficient. If each step in a sequential agent workflow succeeds 99% of the time, a 17-step chain completes 84% of the time. At 95% per step — realistic when you factor in rate limits, tool timeouts, and context overflows — completion drops to 42%.

Traditional retries handle transient errors at individual steps. They don't help when the accumulated state is the asset. Steps 1 through 13 gathered context, made decisions, built partial outputs. A crash at step 14 doesn't just lose one API call. It loses everything the agent learned along the way. And because LLM calls are non-deterministic, replaying from scratch produces different results — which may invalidate downstream logic built on the original reasoning.

The cost isn't just compute dollars. It's correctness.

Three Approaches to Keeping Agents Alive

The infrastructure ecosystem has converged on three patterns. Each makes different tradeoffs about complexity, portability, and how much it changes your code.

Temporal LangGraph Cloudflare Fibers
Mechanism Event-sourced replay Full-state snapshots per node Explicit stash() to SQLite
Recovery Replays history, skips completed activities Reloads last checkpoint, re-enters graph Calls onFiberRecovered with snapshot
Determinism required? Yes — no Date.now() or Math.random() No No
Storage model Append-only event log Full state at each node Developer-controlled snapshots
Best for Complex, long-running multi-agent orchestration Graph-structured agent workflows Edge-deployed agents on Cloudflare

Temporal records every activity completion in an append-only history. A crashed worker gets replaced; the new one replays the history, reading cached results instead of re-executing. Seamless — but your workflow code must be fully deterministic. No timestamps, no random values, no environment variable reads inside the workflow function. Temporal provides deterministic alternatives for all of these, but teams hit this constraint hard when porting existing agent scripts. For very long workflows, event histories grow without bound and continueAsNew snapshots state into a fresh execution chain. Essential plumbing you have to plan for upfront.

LangGraph takes a simpler approach. Every node in the execution graph is a checkpoint boundary — full state serialized to PostgreSQL (or MemorySaver for local development). Recovery loads the last checkpoint and re-enters the graph at that node. No determinism requirement since you're restoring state, not replaying code. The tradeoff is storage volume: every checkpoint contains the complete state rather than a delta.

Cloudflare's Project Think, announced during Agents Week on April 13, introduces fibers — durable function invocations backed by SQLite on Durable Objects:

this.runFiber("research", async (ctx) => {
  for (let i = 0; i < topics.length; i++) {
    const result = await this.research(topics[i]);
    findings.push(result);
    ctx.stash({ findings, step: i });
  }
});

async onFiberRecovered(ctx) {
  const { findings, step } = ctx.snapshot;
  await this.resumeResearch(findings, step + 1);
}

Fibers use a two-tier survival model. Short tasks call keepAlive() to prevent eviction. Long tasks — CI pipelines, video generation — hibernate: persist a job ID, sleep consuming zero compute, wake on callback. You only pay while the agent is actually thinking.

The Side Effect Trap

Checkpointing solves "resume." It doesn't solve "don't do it twice." If step 10 sends an email and the crash hits at step 11, the recovery replay will happily send that email again unless you prevent it.

Every framework punts this to the developer. The pattern that works: bind each external write to a deterministic idempotency key — {workflow_id}:{step_name} — and have the receiving service deduplicate. Simple, unglamorous, and entirely your responsibility. No framework automates idempotent side effects for you, despite what the marketing pages imply.

Skip This If

Quick agents — under 30 seconds, single tool call — don't need checkpointing. The overhead of serialization, storage writes, and recovery handlers isn't worth it when a plain retry costs pennies. The heuristic: if a failure wastes more than a dollar or more than a minute of accumulated work, add durability. Otherwise, exponential backoff handles it.

Where the Stack Is Heading

Agent frameworks are absorbing durable execution rather than delegating it to external engines. LangGraph ships checkpointing natively. Cloudflare baked fibers into the agent SDK. Microsoft's Agent Framework has an active discussion thread about adding Temporal integration. Inngest, Dapr, and a handful of startups like SnapState are building agent-specific durability layers from scratch.

The reason is fit. Agent state is larger than microservice state — full conversation histories, tool results, reasoning chains. Execution is non-deterministic. Re-execution costs real money. General-purpose workflow engines work, but purpose-built primitives match the shape of the problem better.

Meanwhile, IDC reports that 88% of agent proof-of-concepts never reach production. A non-trivial percentage of those die to exactly this failure mode. The demo worked because nothing crashed during the three-minute run. Production doesn't extend the same courtesy.