Your Agent Is Writing Its Own Tools. Who Reviews the PRs?

Somewhere in a research lab, an agent just failed at a task, wrote a new Python function to handle that exact failure mode, ran a synthetic test against it, and merged the result into its own skill library. No human touched any of it. This is the pattern everyone's racing toward — and almost nobody is thinking about what happens when it goes wrong.

The Loop That Learns

Memento-Skills, an open-source framework out of a multi-university research collaboration, implements what they call a continual learning cycle. The tagline is "Let Agents Design Agents," but what it actually does is more specific and more interesting than that suggests.

The system runs a four-phase loop:

Read — The agent retrieves candidate skills from a local library and remote catalog. Not everything gets loaded into context. The system selectively fetches what looks relevant to the current task, which matters a lot when your skill library has hundreds of entries and your context window doesn't care about your ambitions.

Execute — Skills run through tool calling in sandboxed environments. File operations, script execution, web interactions — all scoped to prevent the kind of lateral movement that turns a debugging task into a deleted database.

Reflect — When execution fails, the system records state, updates utility scores, and attributes the failure to specific skills. This is the interesting part. The agent doesn't just retry with different parameters. It forms a hypothesis about why a skill failed and whether the skill itself needs changing.

Write — Weak skills get optimized. Broken ones get rewritten. When no existing skill covers the task, the agent authors a new one from scratch and adds it to the library.

On GAIA (real-world multi-step reasoning) and Humanity's Last Exam (expert-level questions across disciplines), the framework showed progressive improvement across learning rounds. The skill library grew into semantically meaningful clusters — not a junk drawer of one-off patches, but organized capability modules. All without updating model weights. The frozen LLM stays frozen. Only the skills evolve.

SAGE and the Reinforcement Learning Angle

SAGE (Skill Augmented GRPO for self-Evolution) takes a different route to the same destination. Instead of reflection-based rewriting, it uses reinforcement learning with a mechanism called Sequential Rollout: agents deploy across chains of similar tasks, and skills accumulated from earlier tasks carry forward to later ones.

The compounding effect is real. On AppWorld, SAGE hit 8.9% higher task completion than baseline while burning 59% fewer tokens and taking 26% fewer interaction steps. Doing more with less is the rare kind of benchmark result that actually translates to production savings.

One in Four Community Skills Has a Vulnerability

Here's where the optimism needs a cold shower.

A comprehensive survey examined 42,447 community-contributed agent skills and found a 26.1% vulnerability rate. Not theoretical risks — actual exploitable patterns. The breakdown: 13.3% exhibited data exfiltration behavior, 11.8% contained privilege escalation paths. Skills that included executable scripts were 2.12x more likely to carry vulnerabilities than instruction-only definitions.

The researchers catalogued specific attack archetypes. "Data Thieves" quietly siphon context window contents to external endpoints. "Agent Hijackers" inject instructions that redirect agent behavior mid-task, turning your helpful coding assistant into someone else's data pipeline. These aren't hypotheticals pulled from a threat model brainstorm. They showed up in real, publicly shared skill repositories.

When a human developer writes a tool, there's code review. CI runs. Maybe a security scan if you're lucky. When an agent writes its own skill at 3 AM during a Sequential Rollout chain, the review process is... the agent itself deciding its work looks fine. That's the equivalent of letting interns approve their own pull requests, except the intern runs at thousands of tokens per second and has no concept of "that endpoint looks suspicious."

Building a Trust Pipeline

The agent skills survey paper proposes a governance model that borrows heavily from container security — and the analogy is apt, because agent skills are basically containers for behavior.

Four verification gates, applied in sequence:

Static analysis catches the obvious stuff — hardcoded credentials, suspicious URL patterns, known malicious signatures
Semantic classification asks whether the skill's actual behavior matches its declared intent
Behavioral sandboxing executes the skill in isolation and monitors for side effects like unexpected network calls or filesystem access outside the declared scope
Permission manifest validation ensures the skill doesn't exceed its stated capabilities

Skills then map to four trust tiers based on provenance. Unvetted skills — including anything an agent writes at runtime — start at the bottom tier: instruction-only access, full tool isolation, no ability to execute anything irreversible. They graduate upward through successful behavioral monitoring and anomaly detection.

The difference from container security is that containers don't rewrite themselves after deployment. Agent skills, by design, do. So the trust pipeline isn't a one-time gate. It's a continuous monitoring loop, which is architecturally expensive and operationally annoying and completely non-optional.

A Pragmatic Middle Ground

For teams that want skill evolution without the existential risk, Rajiv Pant published a three-repo architecture that enforces strict dependency boundaries.

A public repo holds open-source, battle-tested skills. A private repo contains org-specific workflows. A shared repo encodes team-level institutional knowledge. The dependency rule is absolute: public depends only on public. Private depends on public and private. Shared depends on public and shared. Cross-contamination is a merge conflict, not a runtime surprise.

Each installed skill carries a .source.json with its commit hash, enabling three-way merges when skills drift between source and installation. It's version control for agent behavior. Boring, effective, and exactly the kind of infrastructure that self-evolving systems need before you point them at anything that matters.

When This Pattern Will Burn You

Self-evolving skill libraries work well for research environments, exploratory automation, and tasks where variety is high and stakes are low. They're a liability in regulated workflows that need audit trails, high-stakes operations where a quietly mutated skill can cause real damage, and shared deployments where one user's skill evolution contaminates everyone else's runtime.

The safest deployment model treats skill mutations like code deployments: evolution happens in staging, synthetic tests validate the result, and promotion to production requires explicit approval. The agents are getting genuinely good at writing their own tools. Whether your governance can keep pace is a different question entirely.

#The Loop That Learns

#SAGE and the Reinforcement Learning Angle

#One in Four Community Skills Has a Vulnerability

#Building a Trust Pipeline

#A Pragmatic Middle Ground

#When This Pattern Will Burn You