Why Multi-Agent Systems Beat Solo AI Coding

7 min read
Alireza Bashiri
Alireza Bashiri
Founder
Multi-agent harness diagram showing planner, generator, and evaluator working together

Video: The AI Automators breaks down Anthropic's research on multi-agent harnesses (17 min)

Anthropic just dropped something interesting: a research deep-dive into long-running AI agents that actually ship working software. The core finding isn't subtle — when you separate the agent doing the work from the agent judging it, quality goes up. Way up.

I've been following this thread for a while. Their earlier work on frontend design skills and long-running harnesses showed promise, but hit ceilings. This new research breaks through by borrowing from Generative Adversarial Networks — that generator-evaluator feedback loop that made image AI actually useful.

The problem with solo agents

Here's what happens when you ask a single AI agent to build something complex: it starts strong, loses the plot as the context window fills, and confidently delivers broken code. The agent will praise its own mediocre work because it's terrible at judging what it just produced.

Anthropic calls this "context anxiety" — models wrap up work prematurely as they approach what they think is their limit. They also observed that agents are weirdly lenient when grading their own output. Whether it's a subjective design task or verifiable code, the agent skews positive.

The fix? Split the roles.

The three-agent architecture

The final system uses three agents with distinct jobs:

Planner — Takes a one-sentence prompt and expands it into a full product spec. It's told to be ambitious about scope and focus on product context rather than implementation details. The planner also weaves AI features into the spec where it makes sense.

Generator — Builds the app feature by feature, working through sprints. Each sprint implements against the spec and self-evaluates before handing off to QA. It has git for version control and builds with React, Vite, FastAPI, and a real database.

Evaluator — Uses Playwright to click through the running app like an actual user, testing UI features, API endpoints, and database states. It grades each sprint against product depth, functionality, visual design, and code quality. If anything falls below threshold, the sprint fails and the generator gets specific feedback.

Before each sprint, generator and evaluator negotiate a "sprint contract" — agreeing on what "done" looks like before any code is written. This bridges the gap between high-level spec and testable implementation.

The results are not close

Anthropic ran a head-to-head comparison: a solo agent versus the full harness, both given the same prompt to create a 2D retro game maker.

ApproachDurationCostResult
Solo agent20 minutes$9Broken on first use
Full harness6 hours$200Fully functional

The solo run looked okay at first glance but fell apart quickly. The layout wasted space, the workflow was confusing, and the actual game didn't work — entities appeared on screen but nothing responded to input. Digging into the code revealed broken wiring with no surface indication of what went wrong.

The harness run started from the same one-sentence prompt but expanded it into a 16-feature spec across ten sprints. Beyond the core editors and play mode, it included sprite animation, behavior templates, sound effects, AI-assisted sprite generation, and shareable game export links.

The app actually worked. I could move my character, jump between platforms, and play the game. Was it perfect? No — physics had rough edges and some edge cases needed attention. But the core feature functioned, which the solo run never managed.

What this means for AI skills

The SaaS Builder skill already uses some of these principles — structured artifacts, context handoff, iterative refinement. But Anthropic's research suggests there's room to go deeper, especially on the evaluation side.

Their frontend design work developed four grading criteria that could apply to any visual work:

  • Design quality — Does it feel coherent, or like a collection of parts?
  • Originality — Custom decisions, or template defaults and AI slop patterns?
  • Craft — Typography hierarchy, spacing consistency, color harmony
  • Functionality — Can users understand what to do and complete tasks?

They weighted design and originality more heavily than craft and functionality, since Claude already scored well on technical competence by default. The explicit penalty for generic AI patterns pushed the model toward more aesthetic risk-taking.

In one example, the model generated a clean but predictable museum landing page for nine iterations. Then on cycle ten, it scrapped everything and rebuilt it as a spatial experience — a 3D room with checkered floor rendered in CSS perspective, artwork hung on walls, doorway-based navigation between galleries. That creative leap doesn't happen in single-pass generation.

The cost question

Six hours and $200 sounds expensive compared to twenty minutes and $9. But which one actually delivered value? The broken solo run is effectively worthless. The working app, even with rough edges, is something you could ship, iterate on, and improve.

Anthropic also found that as models improved, some harness complexity became unnecessary. Opus 4.6 eliminated the need for context resets and sprint decomposition. But the evaluator still adds value for tasks at the edge of what the model can do reliably solo.

The lesson isn't that you need maximum complexity forever. It's that you should stress-test your assumptions, strip away what's not load-bearing, and keep finding novel combinations as models evolve.

What founders should take away

If you're using AI to build products, don't expect one agent to do everything well. The Taste & Design skill helps with visual judgment, but having a separate evaluation step — whether that's another agent or your own critical eye — catches things the generator will miss.

The Guerrilla Marketing skill similarly separates strategy from execution, which matters when the strategy itself needs iteration.

For most founders, the practical takeaway is simpler: build in review steps. Don't let your AI agent grade its own homework. Have it produce something, then have something else — another agent, a different prompt, or you — critique it before moving forward.

The space of interesting agent combinations isn't shrinking as models improve. It's moving. The interesting work is finding the next combination.

Not sure which skills would help your specific build? The skill quiz takes 60 seconds and points you to the right ones.


Sources: