Agents don't care who's the project manager

The Demo That Stuck With Me

I was at an event with friends a little while ago when someone presented what, at first glance, was an incredibly impressive AI system.

Ten agents.

Each one impersonating a role you’d recognise from any product team: a team lead, a senior engineer, a junior engineer, a backend specialist, a frontend specialist, reviewers, various other specialists. An orchestrator sat in the middle, coordinating their work. They negotiated with each other. They passed tasks back and forth. They deliberated.

You could almost see the organisation chart.

And everyone in the room nodded approvingly. It looked exactly like how we build software — if we were building it with humans.

That was the first thing that gave me pause.

Something Felt Off

It wasn’t that it didn’t work. It did.

It wasn’t that it wasn’t clever. It clearly was.

It’s that I’d been getting the same kind of outcomes — often better ones — with a much simpler setup. One capable model. A handful of well-curated markdown files for context. A clear instruction. A review pass. Done.

So as I stood there watching, I kept asking myself the same question: what are those ten agents actually adding?

Not “what roles do they represent”. That part was obvious.

But: what capability? What information? What interface? What does a “senior engineer agent” know that the base model doesn’t? What does the “team lead agent” do that the “junior engineer agent” can’t?

The honest answer, as far as I could tell, was: nothing that improved the result — only creeping token use and runaway cost.

They were costumes. Elaborate, impressive costumes. And the model was doing exactly the same job underneath them all.

Listen to the Labs

There’s something I’ve noticed over the past year, listening to people who actually build frontier models. They keep saying the same thing — again and again and again and again.

Don’t go for complex. Just pick the most capable model.

It’s almost monotonous how consistent that message is. Anthropic’s engineering team puts it plainly in Building Effective AI Agents: “finding the simplest solution possible, and only increasing complexity when needed.” OpenAI says it the same way in A Practical Guide to Building Agents: “maximize a single agent’s capabilities first.” The frontier labs that make these models keep telling us the same thing — use the best model you can, give it the right context, and only add structure when something concretely breaks.

For the “right context” half, Anthropic’s Effective context engineering for AI agents is the clearest first-party case I’ve read for the idea that the work is in what tokens you let into each step—not in how many job titles you print on the org chart.

That advice runs directly against what a lot of the industry is doing.

Open any AI build showcase, conference demo, or breathless LinkedIn post, and you’ll see a different pattern: swarms of agents, elaborate orchestration, planner → researcher → critic → executor → reviewer pipelines. A whole movement of people building agentic machinery to solve problems that, in my experience, a single well-prompted call would handle just fine.

So who’s right — the people building the models, or the people building machinery around them?

Architecture, or Theatre?

After the demo, I kept turning this over in my head. Agents aren’t inherently wrong. I use them. I build with them. There’s real value there.

So where’s the line?

Here’s where I landed.

Agents make sense when they represent real boundaries in a system.

A customer support agent with its own context, tools, and interface to a specific product — that’s a real thing. Different audience, different source of truth, different interface.

A routing agent that decides which downstream system should handle a request — that’s a real thing. It’s making a decision about where to send work.

A data-extraction agent specialised on a specific schema, talking to a specific API — that’s a real thing. Different tooling, different data, different output contract.

These aren’t costumes. They’re architecture. They exist because the system they’re part of has actual boundaries that map to the agents.

Agents become theatre when they just represent different moments in a single thought process.

A planning agent, a reasoning agent, a critic agent, a refinement agent — those aren’t boundaries. They’re phases of thinking. And the model already does all of them, inside a single forward pass, when you prompt it well.

Externalising them into separate agents doesn’t give the model new capability. It just wraps the same underlying cognition in an organisation chart we recognise.

Same model on both sides. Left: roles, hand-offs, an org chart shape. Right: one continuous context, three explicit phases.

Skills Are for Exceptions, Not Defaults

There’s a related mistake I keep seeing, tied up with the agent explosion: people stuffing each role-agent with generic skills.

A “team lead skill” full of stakeholder-management frameworks. A “junior engineer skill” explaining how to approach code review. A “backend specialist skill” that lists generic Postgres performance tips.

This is a misunderstanding of what skills are for.

Skills earn their keep when you want the model to do something in a specific way — a way that isn’t simply best practice. Your company’s pricing logic. Your test conventions. The particular shape of your compliance requirements. The peculiarities of your internal tooling. The way your team names things.

In those cases, it often doesn’t take more than a few well-written markdown files. The clever LLM reads them, internalises the constraint, and behaves accordingly. And that’s usually enough — maybe with a sweep of QA afterwards, done by the same engine, not broken out into yet another agent.

But most of what people put in skills doesn’t fall in that category. It’s just … best practice. Stuff the base model already learned during training, from every book, codebase, and blog post written on the topic.

When you “give the AI a senior engineer skill” full of generic checklists about what senior engineers do, you’re not adding anything. You’re restating what it already knows, in worse detail than its training data, and spending tokens to do it.

Skills earn their keep when they encode exceptions. They waste tokens when they encode defaults.

Plan, Execute, Review

If you strip away the orchestration theatre, what’s left?

In my experience, this is usually enough:

Plan — frame the problem clearly. Give the model the context it needs. State the constraints. State the success criteria.
Execute — let a capable model produce the solution in a single shot.
Review — one sweep of critique, verification, or refinement against explicit criteria. Same engine is fine.

That’s it.

No orchestrator. No negotiation. No hand-offs between specialist agents. No critic → refiner → critic-again spirals. A tight three-step loop that preserves context continuity the whole way through.

And in the overwhelming majority of cases I’ve tried it, it covers more ground than people expect.

If that loop fails, it’s usually not because you need more agents. It’s because the planning was thin, the context was incomplete, or the task wasn’t well specified. More orchestration doesn’t fix any of those things. It just hides them behind structure.

This ties back to something I keep returning to in Still True and The Feedback Loop: when AI fails to solve a problem, it’s almost never because the model isn’t smart enough. It’s because we didn’t give it the context. Piling on agents doesn’t add context. It fragments the context you already had.

We’re Mimicking Ourselves

I think something larger is happening in how people are onboarding to AI right now.

We’re taking the old world — the one we know, the one we built out of necessity — and we’re mimicking its shape in AI’s behaviour. Roles. Hierarchies. Handoffs. Reviewers. Gates. Team leads. Junior seats.

But those structures didn’t exist because they were optimal.

They existed because there were many of us. Separation of concerns, clear roles, step-by-step processes — these were all coping mechanisms for the fundamental constraint that humans can’t hold everything in their head at once, can’t specialise in everything, and can’t communicate telepathically. So we invented structure. Meetings. Org charts. Handover documents. Review gates.

The model doesn’t have those constraints.

It can hold the full context at once. It doesn’t need to specialise, because the specialisation already happened during training. It doesn’t need to coordinate, because it’s one thing.

Asking “how should we structure our multi-agent system” often turns out to be asking: how do we recreate the coordination problems of a ten-person team for a system that doesn’t have ten people in it?

It’s like saying AI cares about beautiful, readable code because humans do. It doesn’t. It cares about what’s efficient. It can read messy code just fine.

💡 Useful reframe: Mimicking human structure can be fruitful for the humans — to monitor, to track, to feel oriented through the same cognitive shortcuts our eyes and ears are used to. That’s fine. Just don’t do it for the AI’s sake. The AI doesn’t care.

One More Thing: Tokens

There’s a more concrete problem with all of this, too. It’s wildly inefficient.

Every agent in a pipeline typically:

Repeats parts of the original problem statement
Adds its own role-specific instructions
Produces intermediate outputs that get serialised and passed along
And then another agent “rediscovers” much of what the previous one worked out

By the time the final answer comes out, you’ve burned tokens on framing the problem multiple times, shuffling context between steps, and letting different agents duplicate each other’s work.

And the frustrating part: a single well-prompted call, with the right context, would often one-shot the entire thing.

So it’s not just the cognitive cost of extra structure. It’s the literal cost of extra tokens — paid for the privilege of a process that feels more controlled but often isn’t.

Token inefficiency is the symptom. The disease is architectural smell.

A Simpler Default

The way I build now — and recommend to others — is this:

Pick the most capable model you can.
Curate the context carefully. A few markdown files of genuinely non-obvious information is usually enough.
Write one clear, explicit prompt.
Do a single pass of review with the same engine.
Only add structure when something concretely breaks that needs it.

Not because it looks impressive. Not because it feels familiar. But because it’s the smallest thing that works — and smaller things are easier to reason about, cheaper to run, and faster to iterate on.

If you want more texture on how this plays out in day-to-day engineering practice, I’ve written about it in Shifting Gears and NK’s AI Cookbook v2.

Closing

The ten-agent demo wasn’t useless. It was solving a real problem.

Just not the one everyone thought.

It wasn’t solving “how do we get the best result from the model”. It was solving “how do we make the system feel like something we already understand”. And for the humans in the room, that’s a legitimate thing to want. Familiar shapes are comforting. Org charts feel safe.

But those two goals aren’t the same. And we shouldn’t confuse them.

Use agents when they map to something real.

Avoid them when they’re just the shape of how we used to work.

That’s the difference between architecture and theatre.