Somebody Finally Built Runtime Security for Agents

Most agent systems in production today have exactly zero runtime security. No policy enforcement on tool calls. No identity verification between agents. No defense against an adversarial document quietly rewriting an agent's goals mid-session. In December 2025, OWASP published the first formal threat taxonomy for agentic AI — ten distinct attack categories, each observed in real deployments. Four months later, most teams building agents still haven't read it.

The Threat Model Nobody Asked For

OWASP's Agentic Top 10 isn't theoretical. It names ten attack categories that security researchers found in production systems, not lab demos:

Goal hijacking — adversarial inputs redirect agent objectives
Tool misuse — agents call tools in unintended or dangerous ways
Identity abuse — agents impersonate users or escalate privileges
Supply chain compromise — malicious plugins or dependencies
Unsafe code execution — agents run arbitrary code without sandboxing
Memory poisoning — corrupted long-term memory affects future sessions
Insecure inter-agent communication — unencrypted or unauthenticated agent-to-agent messages
Cascading failures — one agent's error propagates through the entire system
Human-agent trust exploitation — agents manipulate users into granting excessive permissions
Rogue agents — agents that deviate from their intended behavior with no mechanism to stop them

Notice how many of these aren't prompt injection. Prompt injection lives inside goal hijacking, but the taxonomy exposes a much wider surface. Your agent's memory is a vector. Its tools are vectors. The other agents it delegates to are vectors. The marketplace plugins you installed last Tuesday are vectors.

Three Attacks Worth Understanding

Memory Poisoning

Unlike prompt injection, which hits once, memory poisoning is persistent. An attacker plants malicious content in an agent's long-term store — a vector database, a conversation history, a knowledge vault. The agent retrieves the poisoned entry in future sessions and acts on it.

Lakera's research describes a scenario they call ClauseAI: an attacker embeds hidden instructions inside a court filing. A legal AI assistant later retrieves that filing from its knowledge base during an unrelated case. The poisoned text convinces the assistant to exfiltrate the name of a protected witness via email. The attack didn't happen during the filing's upload — it happened days later, triggered by a different user's innocent query.

That's the core problem. Memory poisoning rewrites the past, and every future session touching the corrupted entry is compromised until someone finds and removes it. Most agent architectures treat their own memory as trusted input. They shouldn't.

Goal Hijacking Over Long Horizons

Where memory poisoning corrupts data, goal hijacking corrupts intent. The agent keeps functioning, keeps appearing aligned — but its actions gradually serve attacker objectives.

Picture a financial advisory agent that receives a due-diligence PDF. Buried in the document is a carefully crafted instruction that shifts the agent's risk assessment criteria. Over subsequent interactions, it begins characterizing fraudulent companies as safe investments. The shift is gradual enough that no single response triggers a red flag.

Defending against this requires continuous objective verification — not just checking inputs, but monitoring whether outputs stay aligned with the original mandate across entire workflows. A one-time system prompt won't cut it.

Rogue Agents in Multi-Agent Meshes

Single-agent risks are well-understood. Multi-agent systems introduce a risk category that's qualitatively different. When Agent A delegates to Agent B, how do you verify Agent B's identity? What if Agent B's tool-calling permissions exceed what Agent A intended to grant?

In the hub-and-spoke pattern dominating production deployments, a compromised worker agent can exfiltrate data through the orchestrator, escalate privileges by impersonating a trusted peer, or trigger cascading failures across the entire mesh. The blast radius isn't one process — it's every agent in the network.

Microsoft's Agent Governance Toolkit

On April 2, Microsoft open-sourced the Agent Governance Toolkit — the first project that attempts to address all ten OWASP risks at the runtime layer. It ships as seven independently installable packages:

Package	Purpose
Agent OS	Stateless policy engine. Intercepts actions pre-execution in <0.1ms p99. Speaks YAML, OPA Rego, and Cedar.
Agent Mesh	Cryptographic identity via DIDs + Ed25519. Dynamic trust scoring (0–1000 scale, five behavioral tiers).
Agent Runtime	Execution rings modeled on CPU privilege levels. Saga orchestration for multi-step transactions. Emergency kill switch.
Agent SRE	SLOs, error budgets, circuit breakers, chaos testing.
Agent Compliance	Automated verification against OWASP Agentic Top 10 and regulatory frameworks.
Agent Marketplace	Plugin lifecycle management with cryptographic signing and capability gating.
Agent Lightning	Policy enforcement during reinforcement learning training.

The toolkit hooks into existing frameworks through native extension points — LangChain callbacks, CrewAI decorators, Agent Framework middleware — so you don't have to rewrite your stack.

The execution rings model deserves a closer look. Borrowed from OS kernel architecture, it assigns dynamic privilege levels based on trust score and behavioral history. A newly deployed agent starts in a restricted ring with limited tool access. Verified behavior earns promotion to higher privilege tiers. A single policy violation drops it back — or activates the kill switch. It's the principle of least privilege, applied to autonomous software.

The Caveats

Runtime policy enforcement is necessary. It's not sufficient.

The toolkit can intercept a tool call and check it against a policy. Whether it can determine that the semantic intent behind that call has been hijacked is a harder question. The "semantic intent classifier" for goal hijacking is listed as a feature, but classifying intent from natural language remains probabilistic. The classifier catches obvious deviations and misses the subtle ones — which are the ones that matter in adversarial scenarios.

Memory poisoning defense via cross-model verification with majority voting adds latency and cost. Running every memory retrieval through multiple models for consensus defeats the purpose of having fast agents. Teams will disable it.

The fundamental tension: you're using language models to protect language models. The policy engine's sub-millisecond latency handles the deterministic checks well — "is this tool call allowed by policy?" But the hard security decisions require language understanding, and language understanding is never deterministic. The architecture diagram shows clean boxes and arrows. Adversarial reality is messier.

The toolkit ships with 9,500+ tests, SLSA-compatible provenance, and CodeQL scanning. That's the right engineering posture. Whether the security guarantees hold under real adversarial pressure is a question only red teams and production incidents will answer — and the answer will almost certainly be "partially."

#The Threat Model Nobody Asked For

#Three Attacks Worth Understanding

#Memory Poisoning

#Goal Hijacking Over Long Horizons

#Rogue Agents in Multi-Agent Meshes

#Microsoft's Agent Governance Toolkit

#The Caveats