When an AI agent writes a step-worth of test code, that step is only a hypothesis — a theory about how the world should respond. It’s not until the code actually runs that we find out whether the hypothesis holds true.
This moment — when code meets reality — is where the learning happens. It’s the heartbeat of the entire system: act, observe, adjust. Each iteration makes the agent just a little more capable of understanding how its environment works and what it takes to succeed within it.
Collecting the Right Signals
Once the agent has generated its next step, that step is appended to the growing test — a method inside a Codeception test class. Then, the full test so far is executed.
Codeception runs the commands in sequence: opening pages, clicking buttons, saving settings, verifying outputs. The framework then returns a simple judgment — pass or fail — along with details of what happened.
If the test failed, the output explains why: perhaps a selector didn’t match, or a page element didn’t appear in time. Alongside this output, we also capture the HTML and a screenshot of the browser at the moment the test completed.
Even when the test passes, these artifacts are still collected — because the agent learns just as much from success as from failure. The combination of Codeception output, captured HTML, and screenshot forms a complete record of what happened: a snapshot of the system’s behavior in response to a specific action.
Reasoning About What Changed
Once the test has run, the results are sent to the AI model for reflection. Using the full context — the step that was attempted, the Codeception log, the captured HTML, and the screenshot — the model asks a simple but critical question:
Did this step do what I meant it to do?
If the answer is yes — if the test passed, and the step produced the intended change — the agent declares it valid. That step becomes part of the permanent test, and it’s also saved to a growing step library for reuse in future tests.
But if the test failed, or the action didn’t produce the right result, the agent looks deeper. It examines the information available to it and tries to reason about what failed and why.
Did the element simply take longer to appear? Was the selector incorrect? Did the test reach the wrong page? Or, more fundamentally, did the feature itself not behave as expected?
These questions form the basis of the reasoning loop. The agent interprets the feedback, identifies the likely cause of failure, and adjusts its code accordingly — attempting to address the failure. Then, it runs the test again.
Looping Until Success — and Knowing When to Stop
This process continues in a loop: run, evaluate, correct, rerun. Over time, the agent converges toward a working solution — a step that passes and performs the intended action.
If the step succeeds, it’s appended permanently to the test and stored in the library. The agent then moves on to the next sub-goal, repeating the process until the entire high-level test has been constructed.
If the step continues to fail, the agent keeps iterating — but only up to a limit. After a set number of attempts, the system assumes that something deeper is wrong. Perhaps the feature itself doesn’t work as described. Perhaps the original goal was unclear.
At that point, the agent has several ways to respond. It might decide to generate a new sub-goal, reframing the task to approach it from another angle. Or, if the evidence points to a genuine issue with the feature, it can halt the process entirely and output a bug report — a structured summary of what it observed, what it expected, and why it believes the behavior is incorrect.
This is an important distinction between this system and something like Voyager. In the Minecraft project, the agent could afford to take wrong turns and continue exploring from there. In testing, wrong turns are expensive. A single invalid step can corrupt the logic of the whole test. So here, the loop is conservative. It only builds on known-good progress, rolling back whenever something doesn’t check out.
Knowing when to stop is as important as knowing how to try.
The loop terminates either when success is achieved, when retries are exhausted, or when the agent determines that the task itself is impossible or invalid. When that happens, it either regenerates the goal or hands the process back to a human for review.
This isn’t a failing of the system — it’s a safety net. By recognizing the boundaries of its own understanding, the agent prevents one broken assumption from cascading into a stream of wasted attempts.