What made Voyager remarkable wasn’t that it played Minecraft — it’s that it learned through interaction. It explored an environment, noticed what worked, and built on its successes.
That same principle applies surprisingly well to software testing. A WordPress site might seem far removed from a virtual landscape of trees and caves, but from an AI agent’s perspective, both are interactive worlds governed by cause and effect. The difference is that instead of crafting tables and iron pickaxes, we’re dealing with buttons, menus, and settings panels.
The trick is learning to see a testing environment the same way Voyager saw Minecraft: as a living, explorable world where every action changes the state — and every success or failure is feedback.
Mapping the Minecraft to WordPress
In Voyager, the agent’s world was a Minecraft one. In our case, the “world” is a WordPress site. The agent’s field of vision comes through the browser, and its means of interaction come through — a testing framework that drives Chrome as a user would. I use Codeception, but there are other options available.
From a technical standpoint, Codeception is a front-end to a tool called ChromeDriver, which allows code to send commands to the browser — “click this,” “type that,” “wait for this element.” The agent doesn’t just imagine what would happen; it sees the real result, as the DOM and the browser’s rendering of it.
That’s our environment. It’s dynamic, unpredictable, and sometimes surprising — exactly like a Minecraft world.
Defining “State,” “Goal,” and “Action”
Let’s make this mapping concrete.
State
In Minecraft, the state is what the agent currently perceives: its inventory, its surroundings, the time of day, and so on. In acceptance testing, state is everything the agent can observe in the browser.
That includes the page’s HTML structure, visible elements, console messages, and even a screenshot of what’s on screen. It’s the combination of what the agent has done and what the system looks like now.
Every time a step in the test runs, the system state updates — maybe a button is now disabled, maybe a new message has appeared. The agent uses that updated state to decide what to do next.
Goal
Voyager’s goals were open-ended: to explore and develop new skills. In testing, our goals are narrower but more sharply defined. A high-level goal might be:
“Verify that enabling the plugin’s background blur option sets a visible blur on the background when viewed on the front end.”
That single goal can be decomposed into smaller sub-goals:
- Log in to the admin panel.
- Open the plugin settings page.
- Enable the background blur option.
- Save the settings.
- Visit the site’s front end.
- Check that the background blur appears as expected.
Each of these sub-goals is small enough for the agent to attempt, observe, and verify before moving to the next.
Action
In Minecraft, an action might be “move forward,” “craft tool,” or “attack zombie.” In our world, it’s a line of PHP test code.
Something like:
$I->click('Publish');
Each action changes the browser’s state, and the agent can check whether the change aligns with the intended outcome.
This is where the generative part comes in: instead of writing all these steps by hand, we let the agent propose them, test them, and refine them.
Each Click is a Move, Each Result is Feedback
Once you frame it this way, the parallels become obvious. Every click or input is a move. Every resulting page load or error message is feedback.
The agent’s job is to discover moves which lead it closer to fulfilling the test’s objective. When a move works — the button clicks, the setting saves, the expected element appears — that success becomes part of a growing test script.
When it fails, the difference is instructive. Maybe the selector didn’t match; maybe the page took longer to load; maybe the wrong button was pressed. The agent can analyze that failure, adjust the code, and retry — but crucially, it never builds on a broken foundation.
Unlike Voyager, which could wander into failure and continue from there, our agent can’t afford wrong turns. A bad test step messes up the test and may mess up the state too. So when something fails, it’s either repaired or discarded. Progress resumes from the last known-good point — the most recent partial test that runs successfully.
Over time, this process yields not just fragments of test logic, but full, reliable acceptance tests — working artifacts that can be rerun, reused, and extended.
You could think of it as learning by accretion: the test grows by adding only verified steps, layer by layer, until the full test is complete.
The Acceptance Test as the Final Artifact
Voyager’s skill library captured how to do things — snippets of code for cutting wood or crafting tools — but it didn’t explicitly record an entire “playthrough.” In testing, we want exactly that: the finished sequence of validated steps that a human could run at any time to reproduce the result.
That sequence is the acceptance test. It’s not just a proof that the agent understood what to do; it’s a permanent artifact of that understanding — a reproducible, machine-verifiable record of how to achieve a goal in the system.
And just like Voyager’s skills, these test steps can be stored, searched, and reused. Each one becomes a reference for subsequent generation runs — a memory the agent can consult when facing similar challenges in the future.
From Exploration to Achievement
In the end, what we’re really doing here is adapting a system designed for open-ended exploration into one suited for directed achievement.
Where Voyager wandered and discovered, our testing agent advances deliberately and records what works.
Where the Minecraft agent could afford a few failed experiments, ours must prune them away.
And where the game’s agent learned to survive, ours learns to verify.
Yet beneath those differences, the structure is the same: a feedback loop of observation, action, and reflection — a process of building understanding by interacting with reality.
In the chapters that follow, we’ll step deeper into this world. We’ll look at how our agent perceives its test environment, how it decides what to do next, and how it stores its growing collection of reliable, reusable test fragments.