It was around the time my own system first began to work — when the AI agent could reliably generate, run, and refine acceptance tests — that I paused to see who else might be exploring similar territory. The few resources I found mainly just offered prompts to get ChatGPT to generate a full test. As discussed such tests, generated without detailed knowledge of the target system are likely to be fragile and require significant manual effort to turn them into a working test. But buried among the scattered papers was one that immediately stood out: a study from Tencent titled XUAT-Copilot: Large Language Models for Automated User Interface Testing.

The paper described something strikingly familiar. Tencent’s engineers were building an AI system that could generate, execute, and adapt tests for their own software — not websites or WordPress plugins, but complex mobile applications used by millions of people every day. Different environment, different terminology, same underlying idea.

Here, finally, was independent confirmation that this approach — combining large language models with iterative testing in a live environment — wasn’t just a niche experiment. It was a pattern beginning to emerge in multiple places at once.

The XUAT-Copilot System

Tencent’s system, which they called XUAT-Copilot, aimed to automate user interface testing across the vast and ever-changing landscape of their mobile apps. Like my own system, it was built around a loop: generate, run, observe, and refine.

At its core was a large language model that could translate high-level human instructions — something like “test that the profile page opens correctly” — into executable steps. These steps were then run directly against the app’s interface, using a combination of automated input controls and screen recognition to simulate real user interactions.

Each time a test ran, the system captured evidence of what happened: logs, screenshots, and a record of every interaction. When a step failed, the model analyzed the output, diagnosed the cause, and regenerated the code — adjusting commands, timing, or logic until the test succeeded.

Over time, this created a self-improving process. The AI didn’t just execute tests; it learned how to test better. Each success was stored and reused, gradually building a corpus of reliable testing patterns.

Sound familiar? It should. The parallels with my own system were unmistakable.

Parallel Principles

Though the two systems were developed independently — in different industries, for different platforms — they converged on nearly identical principles.

Both treat the testing process as a feedback loop, not a static script. The agent doesn’t attempt to write a perfect test in one pass. Instead, it experiments, observes, and adjusts. Every failure is data. Every success is an opportunity to store what works and reuse it.

Both systems rely on state awareness — the ability to see what’s happening in the environment, whether that’s a mobile app screen or a WordPress page. And both use that awareness to reason about next steps: deciding which button to press, which input to change, or what evidence confirms that the step succeeded.

Finally, both depend on memory. In Tencent’s case, this took the form of reusable components — patterns of interaction the agent could recall and reapply to new tests. In my system, these memories are stored as verified Codeception steps in a vector database, searchable by semantic similarity. In both cases, the goal is the same: once something has been figured out, it should never have to be rediscovered.

When two very different projects — one for mobile apps, one for WordPress sites — independently evolve toward the same architecture, it’s perhaps a sign that the design is sound.

The Broader View

Reading Tencent’s paper felt like looking across a river and seeing someone building the same bridge from the opposite side. We hadn’t shared notes, but we’d followed the same logic.

Both projects emerged from the same frustration — the slow, manual nature of test creation — and both arrived at the same insight: that a language model, given the right feedback and context, can learn to interact with software environments intelligently.

The convergence is encouraging. It means this isn’t an isolated curiosity or an academic toy. It’s part of a broader shift — one where AI systems are beginning to understand software through interaction, not just description.

In Tencent’s work, I saw proof that the principles behind my own experiments were sound, and that they could scale beyond the niche of WordPress plugins to the vast world of enterprise applications.

As I read through the paper, it became clear that what we’re building — separately but in parallel — may well represent the early stages of a new field: one where AI doesn’t just write code, but learns to reason about how code behaves in the real world.

Tencent’s work also highlighted a challenge I was beginning to face myself: scaling. The process of generating, running, and refining tests works beautifully on a small scale, but the time cost rises quickly. Each loop — generate, execute, evaluate — takes real time, and when multiplied across thousands of steps or hundreds of tests, it becomes substantial. The Tencent team approached this problem with optimizations that mirrored my own emerging strategies: parallel execution, selective reruns, and the reuse of known-good fragments.

In the next part of the book, we’ll turn to the practical challenges that come with this kind of system — how to keep it fast, stable, and trustworthy as it grows — and what it takes to turn this elegant idea into a dependable tool.