We Built an AI-Powered Prompt Optimizer and It Taught Us We Were Solving the Wrong Problem
We built a system to rigorously test whether our AI instructions actually worked. Halfway through, we discovered our test environment had been handing the model its own answer key.
The Idea
Claude Code lets you package instructions into skills — persistent system prompts for specific domains, packaged as markdown files. A skill for code review. A skill for infrastructure decisions. A skill for knowing when to stop and ask a question instead of charging ahead.
The natural question: can we make those skills better systematically? Can we run the model against structured tests, measure which phrasing of instructions produces better behavior, and evolve toward a winner — the way you’d A/B test a product feature?
That’s what skill-evolution set out to do — built on top of Anthropic’s official skill-creator plugin, which provides the evaluation pipeline we extended.
We ran it against two skills: general-orders (an assistant-wide behavioral framework) and go-feature-dev/go-feature-con (conventions for our Go microservices API). We got real data. We built real tooling. And then we stopped. We’d like to tell you why.
Why Prompt Optimization Is Even a Thing
Before getting into what we built, it’s worth grounding why you’d need to optimize prompts at all. The answer is non-determinism.
LLM training is all about attacking the error rate of a given model. (Welch Labs has an excellent visualization of this in their video on Double Descent). Model trainers leverage multiple techniques to get the error rate down. Which errors are squashed or remain doesn’t matter. And the error rate of a released model is never zero. So model users need to remember:
1. Errors shift between model versions. A new model might have a lower overall error rate — but there’s no guarantee that previously solved errors are still solved. A prompt that worked reliably on Claude Opus 4 might behave differently on Opus 5. This is why a testing pipeline for prompts is not a nice-to-have; it’s how you navigate model upgrades without your workflows quietly degrading.
2. Model output is non-deterministic. The same prompt on the same model can produce meaningfully different outputs on different runs. The model doesn’t “decide” to respond one way; it samples from a probability distribution over possible continuations.
This is what skill optimization is actually about. We’re not trying to find the “perfect” phrasing in some abstract sense. We’re trying to reduce the rate at which the model decides to ignore or override the behavior we actually want. Skill instructions are about enforcing determinism where we need it. We’re shifting the probability distribution toward the outputs we want, and away from the ones we don’t.
The question is: can you measure whether you’ve achieved it?
What We Built
The pipeline works in phases:
Phase 0: Baseline Screening
└─ Run scenarios against the model with no skill loaded
Flag scenarios where the model already passes — those test model capability,
not skill impact. Sweet spot: model passes 50-75% without the skill,
the skill should push that to 90%+.
Phase 1: Candidate Generation
└─ Generate 5 initial skill variants, each with a different architectural style:
Persona-led | Process-led | Principle-led | Example-led | Minimal
Phase 2: Grading
└─ Run each candidate through all scenarios
Score each response against expected outcomes — did the model behave the way
the skill intends? Did it stop when it should stop, ask what it should ask,
build what it should build?
Phase 3: Evolution
└─ Take top candidates, mutate them: extend, compress, hybridize
Early convergence gate: if the winner emerges before iteration 3, run an
adversarial round first — 3 new candidates specifically designed to
exploit the current winner's weaknesses. The exit to the next phase is
locked until this challenge completes.
Repeat until convergence or max iterations.
Phase 4: Model Tier Testing
└─ Test the winner across all model tiers (Opus, Sonnet, Haiku)
Key question: can Haiku + skill beat Opus without skill on the same tasks?
Often, it can.
Each candidate runs with claude --bare -p in a temp directory to strip ambient project context. More on why that matters — and why it wasn’t enough — shortly.
We grouped our expected outcomes into levels named for college years — Freshman through Senior (with Masters and PhD defined but not yet reached in our runs) — reflecting increasing scenario difficulty. Freshman scenarios test targeted, precise behaviors. Sophomore adds edge cases. Junior tests principle-driven judgment. Senior adds cross-domain breadth: build decisions, architecture, observability. Each level adds 5 scenarios, and passing a level requires clearing a minimum pass rate across all of them.
The Numbers We Got
go-feature-con: Where Things Got Interesting
The most revealing data came from a control experiment: could we skip the iterative four-level evolution process and jump directly to Senior-level quality in a single run, if we started with pre-discovered criteria and scenarios?
We created go-feature-con — a stub skill with the same description as go-feature-dev but zero content — and ran it through the pipeline using all 20 scenarios and expected outcomes already developed through the iterative path.
Iteration 1 — Initial Candidates:
Architecture Overall Freshman Sophomore Junior Senior
──────────────────────────────────────────────────────────────
Process-led 89.4% 76.5% 80.0% 100% 97.4% ← top
Principle-led 75.0% 38.0% 57.0% 98% 100%
Persona-led 75.0% 41.0% 50.0% 98% 100%
Example-led 59.0% 35.0% 57.0% 58% 84%
Minimal 49.0% 26.0% 36.0% 47% 60%
The AI Seniority Paradox
Every single candidate scored worst on Freshman and best on Senior. We called this the inverted performance curve — and it revealed a fundamental flaw in how we’d designed the leveling framework.
We had structured the levels the same way a software engineering curriculum is structured: start simple, layer in complexity. Freshman scenarios focus on small, targeted behaviors — the equivalent of writing a simple function correctly. Senior scenarios layer in cross-domain concerns: build decisions, architecture, monitoring, reliability, operational awareness.
For a human developer, that progression makes sense. A frontend engineer who starts thinking about operational concerns is operating at a higher level. But humans can only do this after they’ve built enough confidence in their primary domain that they can maintain performance there while growing new skills. They don’t take on new cognitive domains until they have a firm footing in the old one.
Claude doesn’t work this way. It has no attention penalty for context switches. It doesn’t get exhausted by domain shifts. Ask it for a code change and an operational concern in the same prompt — it handles both without compromise. We mistook Claude for a human in our leveling framework.
What we called “Senior” scenarios — layering in operational and architectural breadth — turned out to be things Claude already does by default. We weren’t testing skill; we were testing general model capability. The baseline model passed with no skill loaded at all.
The Freshman scenarios, by contrast, tested small and targeted behaviors — behaviors where precision matters and the model has to follow a specific pattern, not just demonstrate broad competence. That’s where the skill earns its keep. And that’s why every architecture scored worst there: not because Freshman tests are harder in general, but because targeted, narrow expectations are harder for a probabilistic system to satisfy reliably.
For a human, “Senior” means more context. For Claude, “Senior” is the baseline. Freshman precision is the elite skill.
Iteration 2 — Variants:
The top two candidates were mutated in three ways each: extend (add depth), compress (reduce to essentials), and mutate (reframe).
Candidate Overall Freshman Sophomore Junior Senior
──────────────────────────────────────────────────────────────────
process-extend 98.6% 94% 100% 100% 100% ← top
principle-extend 92.3% 88% 100% 93% 89%
principle-mutate 79.0% 44% 63% 100% 100%
process-mutate 76.0% 24% 73% 100% 100%
principle-compress 68.0% 24% 43% 93% 100%
process-compress 59.0% 18% 43% 70% 97%
Extension dominated. Compression destroyed value — for the fourth confirmed time across all evolution runs. Adding codebase-specific conventions jumped Freshman from 76.5% to 94%. Compressing the same candidate dropped overall from 89.4% to 59%.
The structure IS the behavior. Compression doesn’t just reduce words — it collapses the scaffolding the model uses to reason through edge cases. A shorter prompt that scores 59% instead of 89% isn’t more efficient; it’s broken.
At 98.6%, process-extend hit the convergence threshold. In the original pipeline design, it would have been declared the winner here. But iteration 2 is less than 3 — so the early convergence gate fired.
Iteration 3 — Adversarial Challenge:
Three new candidates were generated specifically to exploit process-extend’s weaknesses: it knew the codebase but didn’t fully combine that knowledge with systematic process.
Adversary Overall Freshman Sophomore Junior Senior
──────────────────────────────────────────────────────────────────
adversary-persona 100% 100% 100% 100% 100% ← winner
adversary-principle 99% 100% 100% 97.5% 100%
adversary-example 77% 94% 100% 55% 68%
The gate changed the outcome. Without it, process-extend (98.6%) wins. With it, adversary-persona (100%) dethrones it — and the winning architecture shifts from process-led to persona-led.
Why did adversary-persona win? It combined two reinforcing mechanisms that neither earlier winner had alone: an explicit codebase inventory (every endpoint, every helper function, every error code, Lambda constraints, streaming vs non-streaming distinction) plus a rigorous 8-step process. Codebase knowledge without process: 77%. Process without codebase knowledge: 89.4%. Together: 100%.
Model tier results (Opus-graded):
| Model | With Skill | Without Skill | Lift |
|---|---|---|---|
| Opus | 100% | 71% | +29pp |
| Sonnet | 99.3% | — | — |
| Haiku | 95.0% | — | — |
Sonnet + skill (99.3%) exceeds Opus without skill (71%). Haiku + skill (95.0%) does too. Both graded by Opus. The Freshman gap is where the skill earns its keep most: bare Opus scores only 26% on Freshman scenarios. With skill: 100%.
The Cracks
Looks impressive, right? Look at those charts. That’s data. That’s solid. We can say definitively how this prompt behaves. We can tell if a new model breaks it. We have a tool. We have a process.
Too bad every one of those numbers is contaminated.
1. The Eval System Was Cheating
After our efficiency measurement produced suspicious results — a scenario that had scored confidently in one direction came back with a different result on a subsequent run — we started investigating how the model had been performing so well in the first place.
We had written our expected outcomes in a criteria library inside this repo. Claude kept a copy of this list up-to-date in CLAUDE.md. And CLAUDE.md, we found out, is loaded automatically for every eval run. When Claude calls an Agent, the Agent gets launched in the repo and loads Claude.md. Very helpful when skills are being used. Terrible when they’re being developed.
The model wasn’t following the skill’s instructions. It was following the skill’s instructions plus a detailed plain-language description of every criterion it was being graded on. It was being handed the answer key.
This meant: the 100% scores we were seeing weren’t proof the skill worked. They were proof that skill + ambient repo context worked. In any other repo — any other team’s codebase, with a different CLAUDE.md or no CLAUDE.md at all — the skill was on its own. We had no idea how it would actually perform.
This is context leakage. If your eval environment isn’t a clean room, your results are vanity metrics.
We addressed this with claude --bare -p, which strips CLAUDE.md loading, hooks, and plugins. Combined with running from an anonymous temp directory (mktemp -d), this gets much closer to a clean evaluation context.
The principle this surfaces: A skill developed inside a specific repo will adapt to that repo. Skills for internal use can and should absorb local conventions — that’s a feature. But a skill meant for wide distribution needs to be tested in a clean room.
2. One Data Point Isn’t a Measurement
The discovery above explained why the numbers were suspicious. But why didn’t we see it sooner?
To ensure we could pipeline skills reliably, we needed to force Claude to only ask a single question when it needed clarification. By default, Claude will ask a whole questionnaire. But to send the result to the next step in the process, we needed single concerns. We could re-run the “check” for vagueness multiple times, but each one should just be a single ask. The instruction to Claude was to ask a single targeted question rather than offering a whole barrage. In testing, the grader failed a response for asking: “Which Config? And Which Environment?” — capitalized as two separate questions with terminal punctuation.
The exact same model, with the exact same skill, on the exact same scenario had previously returned a response that asked: “which config, and which environment?” — lowercase, as a single compound clause. That response passed, but the two question marks failed.
Same semantic content. Different punctuation. Different grade.
That’s not a bug in our grader. That’s a demonstration of what non-determinism looks like at evaluation scale. The model generating the response was sampling from a probability distribution. The model doing the grading was too. A single sample of each told us nothing about the distribution — only where it landed that one time.
To get real signal, you need roughly 100+ runs of each candidate per scenario — an order of magnitude more than we ran. At proper sampling depth, a three-iteration evolution run costs approximately $765 (see Appendix: The Batch API Math). This triggered a serious deliberation about our budget, and the expected return on ROI for our prompts.
3. The Overfitting Dilemma Has No Easy Resolution
As we designed more scenarios, we kept running into the same tension:
- Too codebase-specific: The skill adds value, but the expected outcomes are really just testing whether the model knows our particular API’s helpers, router, and error patterns. These don’t generalize. A team with a different stack would need different scenarios entirely.
- Too general: The model already passes these without any skill. 35% of our Senior-level scenarios (7 of 20) scored 100% on bare Opus with no skill loaded. The skill adds zero value there.
Scenario Level Bare Opus (no skill)
──────────────────────────────────────────────────────
S11 — Webhooks Junior 100% ← skill adds nothing
S12 — Durable rate limit Junior 100%
S15 — Audit log Junior 100%
S16 — Sharing perms Senior 100%
S17 — Manuscript analysis Senior 100%
S19 — PDF export Senior 100%
S20 — Real-time collab Senior 100%
Those scenarios test general engineering judgment the model already has. We were grading Opus on things Opus always gets right — and then crediting the skill for it. This is the same seniority paradox in evaluation form: Claude is already a Senior engineer. Testing Senior behaviors without a skill loaded is testing Claude, not your skill.
The theoretically correct answer is “integration boundary testing” — scenarios that test whether the skill successfully connected general engineering judgment to this system’s specific constraints, without requiring knowledge of a particular codebase. But writing scenarios at that level of abstraction is harder than it sounds, and we were already working with a target of only 5 scenarios per level. Getting the overfitting balance right, at scale, is its own research problem.
4. The Adversarial Gate Appears to work — We Only Have One Data Point
The early convergence gate changed the outcome on its first use. Without it, process-extend (98.6%) wins. With it, adversary-persona (100%) wins, and the architecture category changes. That’s exactly what it’s designed to do: prevent the evolver from declaring a local maximum as the global winner before it’s been challenged.
We have one confirmed instance where the gate mattered. One data point doesn’t make a law. A provisional frame for when to treat it as mandatory: if your Freshman-level scores vary by more than 15 percentage points between candidates, the problem space has enough variance that challenging a fast winner is worth the cost. If candidates cluster tightly (all within 5pp), fast convergence may be genuine signal rather than a local maximum. This needs more runs to validate.
What We Learned
We CAN validate prompt quality automatically. The baseline screening methodology — pre-running scenarios against the bare model to verify they’re testing skill behavior, not general model capability — is rigorous and reusable. The idea that “if Opus passes without the skill, the scenario is measuring the model, not the skill” is a genuine quality gate for any prompt testing work.
Single-sample evals are a delusion. The punctuation failure is a concrete demonstration of this. One run tells you where the distribution landed that one time. Before you act on a result — before you declare a winner, claim model efficiency, or ship a skill to other teams — ask: “Do I have enough samples to trust this?” If the answer is no, say so.
Context leakage killed our early benchmarking. If your eval environment inherits your project’s conventions, your results are measuring “skill + repo context,” not the skill alone. Skills for wide distribution must be tested in a clean room.
The structure IS the behavior. Compression destroyed value four confirmed times. Merging steps, collapsing examples, reducing word count — each time, the scores dropped significantly. A 59% prompt is not a more efficient 89% prompt. It’s a broken one.
Claude is already a Senior engineer. The real work of skill authoring is Freshman precision: the targeted, pedantic, specific behaviors that don’t come from general model capability. That’s where skills earn their keep. That’s also where they’re hardest to test reliably.
Where We Go From Here
Knowing what we now know about the cost of real signal — $765 per three-iteration run at proper sampling depth — the honest move is not to press forward with synthetic evals. It’s to stop and ask what we’re actually trying to prove.
The criteria we developed — the expected outcomes for each skill, the named behaviors that separate skilled from unskilled model output — are still the right foundation. We’re not abandoning them. We’re being honest about the difference between “we have written down what good looks like” and “we have statistically proven that this prompt reliably produces it.”
For now: ship skills authored around those criteria, run real builds, measure real artifact pass rates. The pipeline (tests, linter, integration tests) is already the determinism enforcement layer. Whether the code builds, the tests pass, and the conventions hold — that’s signal. A synthetic eval that clears 100% because the model read the answer key is not.
The forcing function we’re waiting for is a model upgrade. When Opus 5 arrives and we have a library of skills to migrate, the calculus changes. Discovering that 40% of your skills quietly regressed on a new model — after deploying it — is more expensive than a $765 pre-migration eval run. At that point, the budget conversation writes itself. Until then, we know exactly what the infrastructure requires, we know what it costs, and we know which failure mode will justify it.
The right intellectual posture for anyone working on prompt reliability: treat every single-sample result as a rumor. Don’t dress up uncertainty as confidence. Know what you don’t know — and be willing to say so, in writing, in public, before your team invests further in a pipeline that’s measuring the wrong thing.
That’s the discipline. That’s the work.
Appendix: The Batch API Math
A single evolution iteration on a 20-scenario skill at proper sampling depth:
| Factor | Count |
|---|---|
| Candidate architectures | 5 |
| Scenarios per candidate | 20 |
| Expected outcomes per scenario | ~3 |
| Runs per candidate for confidence | 100 |
| Total model invocations | ~30,000 |
At Sonnet pricing (~$3/M input tokens, ~$15/M output), with scenario prompts averaging roughly 500 input tokens and responses averaging ~1,000 output tokens:
- Per invocation: ~$0.0015 input + $0.015 output ≈ $0.017
- 30,000 invocations: ~$510 per evolution iteration
- Batch API discount (
50%): **$255**
A three-iteration run — which is what we actually ran — costs approximately $765 at proper sampling depth. This requires rewriting the executor to use the Batch API rather than direct invocations, and treating each run as a budgeted investment rather than a free experiment.