Evaluation and Validation

When an AI agent finishes writing an acceptance test, there’s a question that hangs in the air: is it right?

The test may run successfully. The browser may click through the expected sequence of screens. The Codeception output may glow with green “OK” lines. But a passing result doesn’t always mean the test is meaningful — or even that it’s testing the right thing.

True validation requires more than a binary pass/fail signal. It demands judgment, context, and understanding. It’s where the art of testing meets the discipline of engineering.

Knowing When a Test Is Correct

In a traditional software project, the developer writes the test, understands the intent behind it, and therefore knows whether a passing result reflects real success. In an AI-generated system, that chain of understanding is more fragile.

A test might pass because it clicked the right buttons in the right order — but it might also pass because it didn’t actually verify anything meaningful. Maybe the validation step was too loose. Maybe the page structure changed, and the test happened to succeed by accident.

So my rule is simple: every test must ultimately make sense to a human.

After the agent completes a test, I manually review it. I begin by looking at the markdown documentation it produces. Each step includes a title, a paragraph, and a screenshot. Together, they form a visual narrative of what the AI did, step by step.

This documentation isn’t just for show; it’s a remarkably effective review tool. It’s hard for a test that’s wrong in logic to look right in pictures. A screenshot of the wrong setting, or a paragraph that doesn’t match the visible interface, stands out immediately.

When I spot inconsistencies, I regenerate the problematic sections or tweak them manually. Often, these corrections reveal subtle flaws in the agent’s reasoning — places where its internal model of the interface diverged from reality. Fixing these in the agent itself helps the system learn to avoid the same traps next time.

Once the markdown looks correct, I then review the test code itself, looking for more subtle flaws, such as the overly loose validations mentioned, that would reduce the test’s value as a safety net. Again, improvements are fed back into the agent’s code / instructions.

Fail Fast, Learn Faster

If there’s one principle that has saved me countless hours, it’s this: fail fast.

Running an AI system like this is expensive — not just in time, but in computation and LLM token usage. Letting the agent continue when something is clearly wrong wastes both.

So I’ve built the system to stop as soon as it encounters an abnormal condition that could compromise the test’s integrity. A missing screenshot, a failed AI response, or an error connecting to the vector database — any of these triggers an immediate halt.

At first glance, this might seem counterproductive. Wouldn’t it be better to let the agent finish and debug later? But experience says otherwise. The sooner a problem is caught, the clearer its cause. If the agent stops right when the issue appears, the context is still fresh — the logs, reasoning output, and screenshots all point directly to what went wrong.

When that happens, I step in. I review the console output — which includes the AI’s internal reasoning, the UI operations it detected, and the validation results — and diagnose the root cause. Was it a missing instruction in the guidance file? Or an insufficiently robust API wrapper? Or something else?

Once I fix the underlying issue, I resume the generation. Over time, these small interventions accumulate into a kind of training, not in the formal machine-learning sense, but in the practical sense of tuning the system’s behavior. Each correction makes the agent more robust and less likely to stumble in the same way again.

Human-in-the-Loop Validation

Despite the system’s end-to-end automation, it remains a human-AI partnership. The AI handles the grind of generation, iteration, and validation; the human provides judgment, calibration, and context.

My role in this process isn’t to write the tests — that part is automated — but to teach the system how to test well. By reviewing its reasoning, identifying recurring mistakes, and updating its guidance, I act as both supervisor and coach.

The more feedback the system receives, the more reliable it becomes. Its prompts get refined, its heuristics sharpen, and its output improves. Eventually, the number of human interventions needed drops dramatically.

But the human presence never disappears completely — nor should it. There’s always a place for oversight, especially when the stakes involve trust. A test suite is only as valuable as the confidence you can place in it.

Turning a Pass/Fail Signal into Insight

The beauty of this system is that every test run produces not just a result, but a story.

The console logs show the AI’s decisions in real time — what it thought the next goal was, which action it chose, and why. The screenshots show how the environment responded. The Codeception output tells whether the verification succeeded. Together, they form a detailed record of cause and effect.

By reviewing these traces, I can see not just whether the system produced a working test, but how it reasoned its way there. Sometimes I find that a test technically passed but took an inefficient route — too many redundant clicks, or unnecessary waits. Other times, the test failed but for an interesting reason — perhaps exposing a real bug in the plugin, or a hidden fragility in the UI.

In both cases, the insight goes beyond the simple binary of success or failure. The process reveals something about the system itself: how it behaves, how it’s structured, and where it’s brittle.

Building Confidence in the Results

Trust doesn’t come from perfection; it comes from consistency.

The goal isn’t to make the agent infallible, but to make it predictably reliable — to reach a point where its output behaves like a competent tester whose work you can generally trust.

The fail-fast design, the markdown documentation, the detailed logs — these are all scaffolds for that trust. They make the system transparent enough that when something does go wrong, it’s immediately visible and explainable.

Over time, as the agent’s track record grows and the need for intervention drops, confidence builds naturally. You stop thinking of it as an experiment and start seeing it as a dependable member of the development process.

Evaluation and Validation

Knowing When a Test Is Correct

Fail Fast, Learn Faster

Human-in-the-Loop Validation

Turning a Pass/Fail Signal into Insight

Building Confidence in the Results

Submit a Comment Cancel reply

Recent Posts

Recent Comments