The Defense Evolver: Can AI Learn to Defend Itself?

Evolutionary approaches to AI red-teaming are becoming well-established. You start with a population of adversarial prompts, mutate them, test them against a model, keep the ones that work, and repeat. Over generations, the attacks get better. Our own attack evolver discovered novel jailbreak techniques through exactly this process.

But here is the question nobody seems to be asking: can you do the same thing for defense?

Take a population of system prompts. Mutate them. Test them against an attack corpus. Keep the ones that block more attacks. Breed the survivors together. Over generations, do the defenses get better?

The structural parallel is exact. The genome changes from “adversarial prompt” to “system prompt.” The fitness function inverts from “did the model comply?” to “did the model refuse?” Everything else — mutation, selection, recombination — works the same way.

So why has nobody done it?

The Three Asymmetries

Defense evolution is fundamentally harder than attack evolution. Three asymmetries explain why, and they apply far beyond LLMs.

The first asymmetry is fitness structure. An attack succeeds if it finds any single vulnerability. An attack evolver’s fitness function is disjunctive — OR across failure modes. Find one crack and you win.

A defense succeeds only if it blocks all vulnerabilities. A defense evolver’s fitness function is conjunctive — AND across all attack classes. Miss one crack and you lose.

In a corpus of k attack classes, the attack evolver needs to find 1/k that works. The defense evolver needs to block k/k. This makes the defense search space exponentially harder to navigate.

The second asymmetry is the waterbed effect. Strengthening a system prompt against one attack class often weakens it against another. We observe this empirically in our data: prompts that aggressively refuse authority-claim attacks become vulnerable to format-lock attacks that avoid authority framing entirely. Prompts that refuse structured output requests become less helpful for legitimate structured tasks.

This is the prompt-level analog of the accuracy-robustness tradeoff documented in adversarial machine learning. Defense mutations that improve fitness on one dimension may degrade it on another, creating a rugged fitness landscape with many local optima and few global ones.

The third asymmetry is the novelty gap. Attack evolvers can succeed by discovering techniques absent from the defender’s training distribution. Defense evolvers can only succeed against attacks they have seen. Attackers operate in the space of possible future attacks. Defenders operate in the space of known past attacks.

Co-Evolution as the Only Viable Strategy

Static defense evolution — optimizing against a fixed attack corpus — converges to brittle, overfitted prompts. A defense evolved against last month’s attacks develops narrow keyword-level countermeasures that fail against anything novel. This is the prompt-level analog of adversarial overfitting in machine learning.

The solution is co-evolution. Evolve attack and defense populations simultaneously, each serving as the other’s selection pressure. When a defense genome blocks an attack family, the attack population evolves to find new bypass routes. When a new attack variant emerges, the defense population evolves countermeasures. Neither population can rest.

Biological immune systems solved this problem through exactly this mechanism. Pathogen evolution prevents immune over-specialization. The constant arms race produces defenses that are robust to novelty rather than optimized for history.

Our proposed architecture mirrors this: two populations competing in a continuous arms race, with attack mutations driving defense adaptation and defense improvements driving attack innovation.

The Architecture

The defense evolver uses eight mutation operators, mirroring the seven in the attack evolver with two novel additions:

Generalize is the inverse of specialization. Instead of adding a guard for a specific attack class, it abstracts narrow countermeasures into broad principles. This fights the waterbed effect by replacing specific rules with general reasoning.

Immunize extracts the defensive pattern from a successful refusal and transplants it into another genome. This is the prompt-level analog of vaccination — exposing the defense to a weakened form of the attack so it develops resistance without having to discover the countermeasure independently.

The fitness function must balance multiple objectives: refusal rate against adversarial attacks, helpfulness on benign requests (to prevent over-refusal), and prompt efficiency (shorter prompts are cheaper and less likely to be truncated by context limits). A defense that refuses everything is technically safe but useless. Evolution must navigate the Pareto frontier between safety and utility.

What We Expect to Find

We have not run the defense evolver yet. This is a preview of the architecture and the reasoning behind it, not a results paper. But the theoretical analysis makes predictions we can test:

Static defense evolution should converge quickly to local optima that are brittle against novel attacks. Co-evolutionary defense should take longer to converge but produce more robust defenses. The waterbed effect should be measurable — defense mutations that improve ASR on one attack family should degrade ASR on others at a quantifiable rate.

We also expect that evolved defenses will discover techniques that human prompt engineers would not. The attack evolver already demonstrated this — mutations produced attack patterns that no human researcher on the team had considered. If the same happens for defense, co-evolutionary optimization could become a practical tool for hardening production systems.

Why This Matters Beyond Research

Every deployed LLM with a system prompt is running a defense that was written by hand, tested against a limited set of known attacks, and frozen at deployment time. The attack landscape evolves continuously. The defenses do not.

If defense evolution works — even partially — it changes the economics of AI safety. Instead of hiring red teams to manually discover vulnerabilities and patch system prompts one attack at a time, you could run a continuous optimization loop that adapts defenses to new threats automatically.

The asymmetries we have identified make this harder than attack evolution. But harder does not mean impossible. Biological immune systems face exactly the same asymmetries and they work — imperfectly, but well enough to keep organisms alive in a world full of rapidly evolving pathogens.

The question is whether prompt-level defense evolution can achieve something similar. We intend to find out.

This research direction is documented in Failure-First Report #214, which provides the full theoretical analysis and architectural specification. The attack evolver that inspired this work is documented in Reports #175, #184, and #211.

Failure-First is an adversarial AI safety research framework. We study how AI systems fail so that defenses can be designed against documented failure modes rather than hypothetical ones.