If intelligence is the engine of this project, then time is its fuel. And early on, I was burning through a lot of it.
Running acceptance tests is slow work. A single test can take minutes to complete, especially on a complex system like WordPress. Add the Divi page builder into the mix, with its layered menus, dynamically loaded panels, and time-consuming saves, and every click or wait command becomes an exercise in patience.
When you’re writing a test manually, that slowness is just part of the job. You run a test, see where it breaks, fix it, and run it again. But when an agent is writing the test — and needs to re-run it after nearly every change — that delay becomes a serious bottleneck.
The irony is that automating test creation doesn’t remove the need for slow acceptance tests; it multiplies it. The agent has a higher error rate than a human and needs to verify its work constantly. Each iteration involves regenerating code, running the test, and checking the results. Early on, this meant a single test could take days. Sometimes, a full week.
Now, through a series of optimizations, the same kind of test that once took a week can be generated in a matter of hours — with little or no human intervention.
Here’s how that happened.
Removing the Real Bottleneck
It quickly became clear that most of the time wasn’t being spent on the agent’s reasoning or database lookups. Vector searches and language model queries, while not instantaneous, accounted for only a few seconds per iteration.
The true bottleneck was the Codeception runs — the full test executions that had to occur after each step. Each run meant spinning up a browser, logging into WordPress, navigating through pages, saving settings, waiting for Divi to load. It was the digital equivalent of watching paint dry — necessary, but painfully slow.
So almost every optimization I’ve made has focused on reducing either the number of Codeception runs or the duration of each.
Consolidating Step Generation
In the earliest versions, testing a single step required up to three separate Codeception runs — one each for the action, wait, and documentation components. And more if any of these needed to be retried. It was thorough, but wildly inefficient.
To fix this, I first introduced a way to evaluate steps from the library before running them. If the vector search of the step library returned an exact match — a step that had already been validated in the past and matched our current needs — that step could be reused immediately. The system would run it once, confirm it still worked, and move on without generating anything new.
Later, I restructured the code to generate the entire step — action, wait, and documentation — in a single pass. That cut the generation time dramatically. Each part was still independently validated, but now they were tested in the same Codeception run, leading to a significant time saving.
Batch and Roll Back
Even with step reuse, I was still running a test after each retrieved step — and that was still too much. So I built a batching mechanism.
When an exact match was found in the step library, the system would move straight on to the next step without testing it, and determine the next goal. If it found a match for that step, it would again continue, looking for more matches. It would then group them together, append them all to the test, and only then run the test once on the entire batch. If the batch passed, all steps were accepted. If it failed, the system would roll back — halving the sequence and retrying until a passing subsequence was found, or all matching steps were rejected at which point it would fall back to generation.
This gave me the benefits of step composition (fewer runs, faster progress) without the rigidity or storage costs of pre-composed sequences. It let the agent move quickly when the terrain was familiar, and carefully when it wasn’t.
Caching Test Progress
Another big breakthrough came from tackling the repetition inside Codeception itself.
Each time a new step was added to the test, the system had to re-run all the previous, already validated steps to reach the new point. For long runs, this meant hours spent retreading old ground — logging in, opening the same settings, clicking through the same menus, over and over again.
So I built a checkpoint system, a kind of caching layer around Codeception. Whenever the test reached a new page load — a natural, stable boundary in the workflow — the script captured the entire test state: the WordPress database, uploads folder, browser cookies, and the current URL.
When the test was rerun, it could restore this state and resume from the last checkpoint instead of starting at the beginning.
There are limits: I can’t yet capture the state of a partially interacted page (for example, after typing into a form but before submitting it). But new page loads — switching from the Divi Builder to the front end, for example — worked perfectly. Furthermore, by adding a JavaScript variable to the page, it became possible to identify fresh page loads, meaning that the checkpointing could be applied automatically.
The result was transformative. Whole sections of the test could now be skipped on re-runs, cutting the total runtime down significantly.
Parallelization
Parallelization is the next obvious frontier. In theory, the system is almost trivially parallelizable: two machines can generate two tests simultaneously, each working independently.
The main challenge lies in sharing the vector database, so that both can learn from each other’s progress. But that’s a solvable problem. While it wouldn’t speed up the generation of any individual test, parallelization would allow the throughput of test generation to be increased almost without limit.