In 2023, a group of researchers released a paper called Voyager: An Open-Ended Embodied Agent with Large Language Models that, on the surface, was about an AI playing Minecraft. That might sound trivial — perhaps another entertaining application to some in-game task — but it wasn’t. Contained in that work was a general method for learning through interaction. And it’s that method, not the application to Minecraft itself, that would inspire my work on teaching an AI to build acceptance tests for my company, Divi Booster.

The Minecraft project was called Voyager. The researchers set an AI agent loose inside Minecraft’s vast, open world with very human-sounding instructions: go explore and get better at surviving. They didn’t hand it a fixed list of missions or the rules for winning. Instead, they gave it a way to perceive its environment, to decide what to try next, to write bits of code to perform those actions, and to learn from the results.

What emerged was less like a scripted bot and more like a curious apprentice. It didn’t just play Minecraft; it learned Minecraft.

From Goals to Sub-Goals

Voyager’s first challenge was figuring out what to do. Minecraft is a sandbox game — there are trees, caves, oceans, monsters, but no predefined “next step.” The agent had to decide for itself what progress looked like.

The researchers built a simple LLM-based mechanism for this. The agent would look at its current situation — perhaps it had no tools but was standing near trees — and generate a sensible short-term goal: collect some wood. When that was achieved, it would propose the next one: craft a table, build a shelter, mine some stone.

Each sub-goal was small enough to attempt and verify, but together they formed a ladder of growing capability. This is one of the key insights: the AI wasn’t programmed with a plan; it built its own plan by linking achievable goals.

Seeing and Acting

To make progress, the agent had to see its world. It could read its inventory — what items it had — and its surroundings — where resources or dangers were. This information was fed back into a large language model (LLM), which would then generate a short piece of code for achieving the current goal.

For example, the LLM, using its built-in knowledge of Minecraft, might write a few lines of Minecraft-compatible commands: walk to the nearest tree, swing an axe, collect logs. That code is executed in the game, and the results observed.

If it fails, the language model analyzes what happened, adjusts the code, and tries again. The loop continues until success.

Building Capability Through Reuse

If it works, the agent records both the goal and the working code in a growing library of skills.

Over time, the library fills with small, reliable programs: chop wood, build crafting table, mine iron. Each becomes a reusable building block for future actions. This is the AI’s memory — a way to accumulate experience.

The beauty of this approach is in its reuse. Once the agent learned how to chop wood, it never had to rediscover that knowledge. When it later needed to build a house, it could simply call upon the chop wood skill as part of a larger plan.

The agent wasn’t just completing individual tasks; it was becoming more capable. Because it could combine existing skills, it could now attempt more ambitious goals — mining rare minerals, crafting tools, exploring deep caves — without starting from scratch each time.

By the end of training, Voyager was performing far better than earlier algorithms. It could adapt to new worlds, solve new challenges, and expand its skillset continually. In essence, it had evolved from a rule-following automaton into a problem-solving system.

The Bigger Lesson

For me, the significance of Voyager isn’t about Minecraft at all. It’s about the pattern of: observe, act, evaluate, remember, repeat.

The AI wasn’t trying to predict the perfect sequence of actions in one shot; it was learning by doing, verifying each step, and storing what worked. It could reason about its state, generate new code, and refine it through feedback from the environment. The outcome was code that didn’t just look right — it worked in reality.

That, in a nutshell, is the same idea I wanted to bring into software testing.

If an AI can learn to master Minecraft by iteratively sensing, acting, and improving, perhaps it can also learn to master the art of constructing a reliable acceptance test — one step, one success, one stored skill at a time.