Most agent loops work like this: the model picks a tool, calls it, gets the result, picks the next tool. Rinse, repeat. Somewhere between steps three and thirty, the agent starts contradicting itself — revisiting files it already checked, forgetting hypotheses it formed two turns ago, making decisions that conflict with evidence it gathered five minutes earlier.

A developer on the MiniMax team hit this exact wall. Their 230-billion-parameter model handled complex reasoning in a single turn without breaking a sweat. Wire it into a multi-step agent loop with tools? Performance cratered. The model wasn't broken. The agent framework was silently discarding the model's reasoning blocks between turns — effectively giving it amnesia after every action.

Three Ways Agents Reason (or Don't)

Agent architectures fall into three reasoning modes, and most production systems are stuck on the worst one.

Mode 1: Plan everything, then execute. The model generates a full plan upfront, then executes each step blindly. This works until the third tool call returns something unexpected — and the remaining plan is now based on assumptions that no longer hold. Brittle by design.

Mode 2: Act, act, act. The model sees a tool result, immediately picks the next action, moves on. No pause. No reflection. This is the default in most frameworks — LangChain, CrewAI, AutoGen all operate this way unless you deliberately configure otherwise. It's fast, but it drifts. By tool call fifteen, the agent has no coherent thread connecting its decisions.

Mode 3: Think, act, observe, think. The model explicitly reasons after every tool call. It updates its working hypotheses, reconsiders the plan, checks whether the last result changed anything fundamental. Then it picks the next action. This is interleaved thinking.

Mode 3 costs more. Significantly more. But it's the only mode that holds up past ten tool calls on non-trivial tasks.

The Performance Gap Is Larger Than You'd Expect

MiniMax ran controlled experiments comparing their M2 model with reasoning state preserved versus discarded between turns:

Benchmark State Preserved State Discarded Improvement
BrowseComp (web research) 44.0% 31.4% +40%
Tau² (long-horizon tasks) 87 64 +36%
GAIA (general assistant) 75.7% 67.9% +11%
SWE-Bench Verified 69.4% 67.2% +3%

BrowseComp's 40% jump is the headline number. Web research requires continuous adaptation — each page you visit changes what you should search for next. Without the reasoning loop, the model treats each search as its first, losing the thread that connects them.

SWE-Bench shows a smaller gap because coding tasks tend to be more structured — the file system provides its own form of persistent state. But even there, preserved reasoning helps.

Your Framework Is Probably Stripping It

Here's where it gets frustrating. Most agent frameworks don't preserve reasoning state by default — and the failure is completely silent.

The OpenAI Chat Completion API has no mechanism for passing reasoning content back in subsequent requests. If your model generates <think> blocks during a tool call, those blocks vanish from the conversation history on the next turn. The model starts fresh, armed with only the tool result and the original system prompt.

Anthropic's API handles this differently. Thinking blocks get appended to message history and persist across turns natively. MiniMax added a reasoning_details field to their OpenAI-compatible endpoint as a workaround — you grab the field from the response and stuff it back into the next request.

But if you're running agents through a framework abstraction layer — and most people are — the question is whether that layer preserves reasoning artifacts or tosses them. Most toss them. Not out of malice. The default data structures just don't have a slot for "stuff the model thought but didn't say out loud."

The symptom is maddeningly subtle: your agent still calls tools, still produces output, still looks like it's working. But it's making decisions without context. A developer debugging this might blame the model, tune the prompts, add more tools — when the actual fix is preserving three extra fields in the message array.

When the Token Tax Is Worth Paying

Interleaved reasoning burns 5–10x more tokens than standard mode. A task that costs 0.10 normally might run 0.50–1.00 with thinking enabled. On hard problems with the token budget cranked up, you can blow through most of a 200K context window on reasoning alone.

Worth it for: autonomous research spanning dozens of searches and synthesizing findings. Complex debugging where the agent tests hypotheses across multiple files. Any workflow where early assumptions regularly prove wrong and mid-course correction is the difference between success and a wasted run.

Not worth it for: CRUD operations. Single-turn lookups. Anything under five tool calls. If the task is predictable enough that a plan-then-execute approach works, skip the overhead entirely.

The Failure Mode Nobody Warns You About

Mode 3 introduces its own pathology: error compounding. A wrong hypothesis in step five gets carried forward as live context. The model's subsequent reasoning builds on that mistake, and each step adds more apparent evidence for the wrong conclusion. By step twenty, the model is confidently wrong — and has a twenty-step reasoning chain to prove it.

This is harder to catch than simple drift because the agent looks coherent. It's calling relevant tools, referencing prior findings, building a logical narrative. The narrative just happens to rest on a rotten foundation.

Practical defense: monitor the ratio of new information to repeated actions. If your agent is on its fifteenth tool call and still circling the same three files with increasing confidence, the reasoning chain has gone off the rails. Kill the run. Adjust the starting context. Restart fresh.

Who Actually Ships This

Claude 4 family supports extended thinking with tool use — reasoning persists natively across turns. MiniMax M2 exposes it via a reasoning_details field. GLM-4.7 runs chain-of-thought before every tool call. Arcee's Trinity Large Thinking was built specifically for long-horizon agent tasks. Kimi K2 Thinking runs INT4 quantized with reportedly 2x faster generation.

OpenAI's o-series reasons internally but keeps it server-side. You can't replay it, inspect it, or ensure it persists across your agent's turns. For production loops where observability matters, that opacity is a real limitation.

Go check what your framework does with reasoning blocks between turns. The fix might be a single configuration flag. Or it might mean swapping your message serialization layer. Either way, it's cheaper than spending another week wondering why your agent keeps forgetting what it was doing.