The Two Numbers That Predict AI Agent Reliability
Back then, I felt lost while writing automated Playwright tests.
I'd turned to Claude Code to help speed up the process, but I kept hitting the same wall. The generated tests were redundant, surface-level, and often useless. It felt like I was spending more time cleaning up after the AI than I would have spent just writing the code myself.
I blamed the model. I thought, maybe relying on AI agents doesn't make sense and this thing I'm doing won't really work out well.
Then I realized I was looking at it entirely wrong. AI agents aren't just "generators", they're complexity capture mechanisms.
Our world is filled with unstructured data: images, logs, audio, and messy codebases (this, my world, specifically). AI models are good at digesting this kind of data, but they need the right environment to be useful.
Tools like Claude Code act as a harness. There are multiple kinds of harnesses, but they all do the same job: they give the AI access to the unstructured and structured data that actually matters. You aren't just asking for code. You're giving the agent the "eyes" to see your entire repository, cross-reference gaps, and build solutions that fit the existing architecture.
The two questions that actually determine reliability
We complain that AI is "non-deterministic." That's not wrong, but it's not useful either. The useful framing is that AI reliability is shaped by the environment you build around the model, and that environment answers two distinct questions, not one.
Question one: how good can this ever get? What's the maximum reliability your setup can achieve, no matter how long you iterate? This is your ceiling.
Question two: how fast do you get there? Given the ceiling exists, what determines whether you reach it in two iterations or twenty (or infinite :) )?
These look like the same question. They're not. They're driven by different factors, respond to different kinds of investment, and most teams confuse them, which is why so many AI workflows get stuck at at 80% reliability and stay there no matter how much prompt engineering gets thrown at them.
Three things shape your ceiling:
Base model capability for the task class. A model that can't do something one-shot probably can't do it any-shot, no matter how clever your loop.
The harness around it. Tool access, file system visibility, codebase context, packaged conventions like skills. The harness determines what unstructured data the model can actually see and act on.
Task clarity. How much ambiguity does your spec leave on the table? An unclear definition of "done" caps the ceiling regardless of model or harness, there's nothing to converge toward.
Three different things shape your rate of convergence:
The prompt. Your intent, expressed precisely. A tailored prompt turns a random walk into a directed path.
Deterministic guardrails. Schemas, validators, hard boundaries on what counts as a valid output. Guardrails are tracks. They don't determine the destination, but they provide stops and direction.
Verifier quality. Whether your loop can actually tell when the output is correct. A great verifier turns "try again" into a directed search while a weak verifier turns it into noise.
Notice what's not in either list: the iteration count itself. Iteration is the variable you're trying to budget. The six factors above are what determine how that budget pays off.
This is the practical version of "AI is non-deterministic." Yes, individual outputs vary. But the system you build is highly deterministic about what it can achieve and how quickly it converges. Once you separate ceiling-setting from rate-setting, you stop arguing about whether AI is "reliable enough" and start asking the right question: what am I really optimizing for?
The Equation of Inevitability
When you combine all of these elements, something fascinating happens. You can map AI reliability as an equation — and unlike the kind of formula that looks rigorous but falls apart when you test it, this one matches what real agentic systems do.
$$y(v)=C⋅(1−e−k⋅v)y(v) = C \cdot \left(1 - e^{-k \cdot v}\right)y(v)=C⋅(1−e−k⋅v)$$
This is a saturation curve. It's the same functional shape as a capacitor charging, a learning curve in cognitive science, or the rate at which a cup of coffee cools. It's also the shape that published agent-loop benchmarks, Reflexion, Self-Consistency, agentic coding pipelines, actually trace when you plot them.
Two parameters carry the whole equation. They're not symmetric, one decomposes cleanly while the other doesn't, and that asymmetry is the most useful thing here.
C — the ceiling. How good can this ever get?
$$C=c⋅m⋅(1−q)C = c \cdot m \cdot (1 - q)C=c⋅m⋅(1−q)$$
c = base model capability on this task class
m = the harness (tools, file access, codebase visibility, skills)
q = irreducible task ambiguity (how much the spec leaves underdefined)
Each of these is naturally a fraction of maximum, they live on [0, 1] because that's what they are. You can rate them qualitatively for any system you're building, and the multiplicative form holds up because all three terms genuinely act as gates: a weak harness chokes the loop the same way an ambiguous spec does, and capability sets the absolute envelope around both.
k — the rate. How fast do we get there?
Unlike C, k doesn't decompose cleanly. I initially assumed it would. Something like k = x · a · ω, where prompt clarity, guardrails, and verifier quality multiply together to produce a rate. It's a tempting shape because each factor seems like it should compound: a great prompt with a great verifier should be more than additively good.
But when I checked it against real benchmarks, the multiplicative form fell apart. Observed k values exceeded what bounded-input multiplication can produce, and the relationships between the inputs weren't even reliably monotonic across configurations. Sometimes a tighter guardrail slowed convergence because it rejected near-misses the loop could have salvaged. Sometimes a stronger verifier mattered more than the prompt by an order of magnitude. There's a real function here, but it isn't x · a · ω.
So:
$$k=f(x,a,ω)k = f(x, a, \omega)k=f(x,a,ω)$$
x = prompt clarity
a = deterministic guardrails
ω = verifier quality
The honest description is qualitative: better prompts, tighter guardrails, and stronger verifiers all push k upward, but the way they combine is system-specific and not currently known to follow any clean closed form. Treat k as something you measure, not derive. Once you have it, you can predict any future iteration count for that system.
This isn't a cop-out. It's the same move physicists make for friction coefficients, chemists make for reaction rates, and economists make for elasticities, when the underlying mechanism is real but the closed-form decomposition isn't established, you measure rather than fabricate.
For what it's worth, I really like to keep things as rigorous as possible.
So v, your number of validation iterations, sits in the exponent, exactly where the original framing wanted it. But the exponent doesn't drive y to 100%. It drives y to C. The ceiling is the real cap, not 1.0, and that turns out to be the most important thing the formula has to teach.
Why this changes how you should think about agent design
Once you accept that C is the cap, three things follow that pure prompting culture gets wrong.
More loops can't beat a low ceiling. If your model isn't capable enough, your harness is thin, or your task is too ambiguous, no amount of validation gets you past C. This explains why teams plateau at 80–90% reliability and stay there no matter how much they tune their loop. The loop is doing its job, pushing you to the ceiling. The ceiling is just lower than you thought.
Capability and validation aren't competing strategies, they pull different levers. A better model raises C (lifts your ceiling). A better verifier raises k (gets you to the ceiling faster). These are different investments with different ROI curves. Stop framing it as "smart model vs. clever loop." Frame it as "do I need a higher ceiling, or do I need to converge faster?"
You can predict when to stop iterating. The math gives you a clean planning rule: you reach about 95% of your achievable ceiling in roughly 3/k iterations. On the Reflexion HumanEval data, that's about 2 to 3 loops. After that, additional iterations are wasted spend. If you want better outcomes, you need to raise the ceiling, not run more loops. This single insight will save you more compute than any prompt-engineering trick.
The conditions that make this formula true
Saturation behavior assumes three things, and if any are missing, the curve breaks down differently:
A working verifier exists. Without one, k collapses toward zero and you get no convergence at all. This is why creative or open-ended tasks don't follow this curve, there's nothing to push against.
Iterations have diversity. Temperature above zero, reflection steps, error message feedback, or external signals, something has to break deterministic anchoring or the loop just produces the same failure repeatedly.
The task is within the model's capability band. Below the band, C is near zero and loops can't help. Far above the band, C saturates near one and loops are unnecessary.
When all three hold, which is exactly the regime the article's examples (Playwright generation, Claude Code, validated agentic pipelines) live in, the formula matches reality tightly enough to plan around.
What Karpathy and Claude skills look like in this frame
Karpathy's autoresearch framing is the saturation curve made operational. He's not arguing the agent has to be brilliant. He's arguing the loop has to be rigorous, propose, validate, correct, repeat. In the formula, that's k engineering: drive up verifier quality and iteration diversity until the loop converges fast. The intelligence work happens inside the loop, not inside any single inference.
Claude skills land on a different lever. A skill is a packaged harness, pre-loaded conventions, tools, and context injected exactly when a task class calls for them. That's a direct boost to m (more relevant context) and a sharpening of the spec that lowers q (less ambiguity about what "done" means). Both of these raise the ceiling C for a whole class of tasks. You're not making the model smarter. You're making the achievable maximum higher, which then compounds through your validation loop.
This is also why "just write a better prompt" advice scales so poorly. Prompt clarity influences k, it speeds you toward the ceiling, but it can't lift the ceiling. The investments that actually unlock new performance, model upgrades, better harnesses, tighter task definitions, better verifiers, sit in C.
Stop just prompting the AI. Define the architectural constraints, force it to parse your existing standards, set the guardrails, build the verifier, and then let the loop run until you hit the ceiling. After that, stop iterating and go raise the ceiling.
Rinse and repeat.



