At some stage in software development, you discover the real challenge isn’t coding new things — it’s coping with deployments and the worry whether what used to work still does.

The software industry’s solution is automated acceptance testing. These tests are scripts that act like real users: they open a site or application, click buttons, change settings, and check that everything behaves as it should. In my WordPress plugin business, Divi Booster, I use browser-controlling tests written in a framework called Codeception, and they’re invaluable.

Every time I update a plugin or migrate to a new version of WordPress or Divi (the WordPress theme my plugins extend), these tests act as a safety net. They confirm that the things I’ve built still do what they’re supposed to do. Without them, releasing would be a leap of faith.

But here’s the problem: writing these tests is slow, tedious work. Each one has to capture the exact sequence of actions, waits, selectors, and validations that mirror real user behavior. Even a small tweak in the user interface can break a test and send you back to the editor. It is a chore — the kind of chore that, in theory, an AI should be able to help with.

Unfortunately, when you ask your favorite large language model to “write an acceptance test for such and such…”, you often get something that looks right but doesn’t work. The structure might be correct, the syntax neat, but when you actually run it, it fails in the first few steps. The AI can work out how the test should look, but it doesn’t know how the system behaves. It can’t see what’s happening in the browser.

I decided to take a different approach — to build an AI that could learn to write acceptance tests the same way humans do: by figuring it out, step by step.

From Playing Games to Testing Code

The spark for this idea came from an unlikely place: Minecraft.

In late 2023, a paper called Voyager: An Open-Ended Embodied Agent with Large Language Models was published. In it, an AI agent learned to play Minecraft not by following prewritten instructions, but by pursuing a high-level goal to explore and discover.

The agent could “see” the world around it — its inventory, its surroundings — and from that, decide on a small next goal: chop a tree, build a shelter, mine for stone. It would then write a short piece of code to perform that action, run the code, and check what happened. If the code worked, it stored it in a growing library of “skills” for later use. If it failed, it refined and retried.

Through this cycle — sense, act, reflect, repeat — the AI gradually became more capable. It wasn’t told how to play the game. It learned how to play the game.

That pattern — breaking a big goal into small, verifiable steps, testing each one against a live environment, and storing what works — is what inspired me.

Turning Exploration into Execution

Software testing may not have quite the same appeal as AI-driven Minecraft, but it can follow the same logic.

Imagine an AI agent with a clear, concrete objective:

“Open the WordPress admin, apply a plugin setting, save it, then check that the change appears correctly on the front end.”

That’s the kind of task an acceptance test would perform. But instead of writing the whole test in one go, the agent could approach it incrementally. It might begin by figuring out how to log in, then how to navigate to the right settings page, then how to interact with the required setting. As it goes, it builds up an acceptance test that records its progress.

At each stage, it can run the test so far, observe what happened (using the pass/fail output and even screenshots or HTML snapshots), reason about the result, and decide on the next action. If a step succeeds, it’s added to the growing script. If it fails, it’s adjusted and retried.

The result is a test that doesn’t just look correct — it works in practice. It’s been verified line by line against a real browser, in a real environment.

And while the process can be slow — because the AI agent must run each partial test and observe the result — that diligence is precisely what gives it accuracy. It’s a methodical, experiential way of building knowledge.

It builds a natural form of documentation — every working step becomes a record of how a feature is used.

About this Book

This is a book about this algorithm and how I implemented it to generate tests for my WP plugin business.

In the chapters that follow, I’ll unpack this project in detail — from how the agent understands its environment and reasons about its state, to the challenges of keeping it focused and efficient.

We’ll also look at the wider implications. Automated acceptance tests can serve as more than quality assurance tools; they can become documentation, feature design tools, even stepping stones toward self-healing software.

We’ll also look at related projects, like Tencent’s XUAT-Copilot, which are exploring similar territory, for further evidence that an AI agent can learn to test software the same way it learned to play Minecraft — not by being told what to do, but by discovering how to do it.