The Autonomous Car Paradox
Every day, humans get into cars and crash them. We kill each other in traffic. We miss stop signs, misjudge the curb, fall asleep at the wheel. We accept it. It’s the background hum of being alive — a tax we’ve quietly agreed to pay.
Then an autonomous car makes a single wrong decision, and it leads the evening news. Regulators convene. Philosophers opine. The story runs for weeks.
Look at the best deployed systems and the reaction doesn’t compute. In constrained operating areas, the strongest autonomous-vehicle data now shows fewer serious crashes than comparable human drivers. But we don’t compare machines to humans. We compare machines to a fantasy of perfection. When a person errs, it’s a tragedy. When a machine errs, it’s a scandal.
That same bias is now shaping how we think about AI security. It’s code. It’s a machine. Therefore it must not be allowed to make mistakes. So we try to wall the model in with guardrails, redactors, validators, policy layers — a multi-layered fence we can point to and say, we’re safe now.
We’re not. And chasing that kind of safety is taking us in the wrong direction.
The Fence Doesn’t Hold
The research is already clear on this. A recent episode of Lenny’s Podcast with AI security researcher Sander Schulhoff spelled it out without ambiguity: you cannot guardrail a large language model into reliable safety. You can’t drop a “sophisticated multi-layer redaction” on top of a public-facing LLM and guarantee nothing bad will happen. The fence can be climbed.
Anyone who wants to get around your guardrails is, almost by definition, creative. They have motivation. They have time. And they’re working on a system whose entire design is to be flexible with language.
Pretending you’ve solved prompt injection with clever instructions isn’t security engineering. It’s decoration.
Two Problems, One Bucket
Before we go anywhere useful, we have to separate two things the industry keeps conflating:
These are different problems. They require different tools. Most current mitigation blurs them, which is exactly why most current mitigation is both too weak for real adversaries and too heavy-handed for normal operation.
It’s also why institutions like NIST are now looking closely at agentic systems. The novel risk isn’t simply that an AI can be wrong. It’s that an agent can be wrong while holding credentials, reading private data, calling tools, and acting on the outside world.
The Self-Agreement Trap
The pattern I see constantly:
- Add a guardrail.
- Add a policy layer.
- Add a validation step.
- Add monitoring.
- Conclude: “we are now more secure.”
But many of those controls are correlated, not independent. When the same underlying model generates the action, validates the action, and explains the action, you haven’t built defence in depth. You’ve built self-agreement. It’s a feedback loop that looks safe in the dashboards and fails in exactly the scenarios that matter.
Different layers, same model, same blind spots. That’s not defence in depth. It’s a mirror holding up another mirror.
What We Already Do With Humans
This is the part of the argument I really want to land.
Every developer in a serious engineering org has production access to some extend. They can push things that will take the company down. They can — if they really wanted to — do malicious damage that would make a prompt injection look quaint. We hand them the keys every Monday morning.
We don’t try to make them infallible. We don’t gate their every keystroke behind a classifier. We don’t redact their thoughts.
We let them in.
What we do instead is design the environment around them. We scope their access. We require pull requests. We run tests. We keep backups. We make actions reversible. We contain the blast radius. We trust the person, and we engineer the surroundings so that when they slip — and they will — the damage is small, detectable, and recoverable.
An AI agent is much closer to that developer than to any piece of classical software. It has reach. It has judgement. It will be wrong. The right response is not to try to certify its judgement. It’s to design the environment — the same way we’ve done for high-trust humans for decades.
If you have trust issues with a person, you don’t fix them by demanding they never make a mistake again. You reduce the chance of harm, and you make sure the mistakes we all eventually make don’t hurt as much. It’s the same for AI agents.
Security Is The Perimeter
Once you accept the human analogy, the conclusion follows fast.
Security for AI agents is not about filtering what happens inside the conversation. It’s about the perimeter. It’s about what the agent fundamentally has access to, and what it can fundamentally do.
- Capability scoping. Don’t give the agent tools it doesn’t need.
- Tool isolation. Two tools shouldn’t share implicit trust.
- Scoped credentials. A read-only key for read-only tasks.
- Reversible actions. Drafts over sends. Soft deletes over hard ones.
- Human-in-the-loop on the irreversible steps.
- Auditable identity. Log which agent acted for which user, with which tool, against which resource.
The perimeter, layer by layer
The agent gets reach. The environment around it is what makes that reach survivable.
This is the same discipline we already apply to employees with production access. It’s not new. It’s just unfashionable compared to the vendor pitch for a “prompt-injection firewall.”
I made the same argument from the quality angle in Don’t Just Tell It. Enforce It. You can’t instruct your way to safety. You have to enforce it with systems that don’t negotiate.
Where The Power Lives
Here’s the tension that keeps this interesting.
The real value of AI — just like the real value of that developer — comes from the external interface. From reach. From the fact that the agent can read an email, query a database, fire a webhook, draft a document, hit an API, act on the world. Lock all of that down and you haven’t secured the agent. You’ve broken it.
You will not get safety by caging the model. You get it by being precise about what counts as a trusted instruction in the first place.
Make Trust Deterministic
Consider what happens when an AI agent reads your inbox.
An email arrives. It contains an instruction. The model has no native way to distinguish:
- A legitimate request from your colleague.
- A prompt injection from a stranger, smuggled in as text.
Content filtering can reduce risk, but it will never close this gap perfectly. Every filter becomes a puzzle for someone with motivation. The stronger path is to stop treating content as the trust boundary, and start authenticating the source.
Security people sometimes call the worst version of this the lethal trifecta: access to private data, exposure to untrusted content, and the ability to communicate externally. Give one agent all three and you’ve created a system that can read the sensitive thing, be manipulated by the hostile thing, and send the result somewhere else. Break one side of that triangle and the system becomes much easier to reason about.
- Signed messages with cryptographic identity attached.
- Tokens embedded in content produced by trusted internal systems.
- Explicit internal vs external labels at the transport layer.
- Deterministic schemas that the model can’t hallucinate past.
Then the rule changes shape.
External input → the agent cannot do Y. Internal, authenticated input → the agent can act. The entire trust decision collapses into something deterministic, cheap to verify, and hard to fake.
This is how the wider software stack has always worked. We don’t ask a browser to guess whether a request is legitimate. We give it cookies, CORS, CSP, HTTPS, signed tokens. It trusts what it can verify. AI agents need the same primitives — not bolted on as prompts, but built into the runtime.
The Platform Will Absorb The Content Layer
Most teams are currently building their own guardrails, validators, and monitoring. This won’t scale, and it shouldn’t have to. In the same way browsers grew sandboxing and cloud platforms grew IAM, AI runtimes will grow native safety primitives: identity-aware tools, capability tokens, signed inputs, standardised boundaries. And yes, better content filters too. They have to. The economics force it.
A concrete example landed the same day I finished writing this. On April 22, 2026, OpenAI released Privacy Filter — a 1.5-billion-parameter PII detection model you can run in a browser or on a laptop, shipped as open source under Apache 2.0. 128k-token context window. 96% F1 on the standard benchmark. The New Stack’s write-up captures the shift well. The fact that it runs locally isn’t a feature. It’s a stance: the data it’s protecting never has to leave the device to be protected.
To be clear: that is not perimeter security. It is content-layer risk reduction. Useful, necessary, and still the wrong place to put the final trust decision. If we are going to lower risk through filtering, redaction, classification, and overwatch, I would rather trust the big labs and platform vendors to bake those primitives into the runtime than watch every team improvise its own fragile prompt-injection firewall at home.
And OpenAI was careful about the framing, which I think they got right. In their own words, Privacy Filter is “one component in a broader privacy-by-design system.” No single filter solves this. The perimeter still has to decide what the agent can reach, what it can do, where it can send data, and which sources count as trusted. The filter lowers the risk inside that boundary. It does not replace the boundary.
Our job for the next few years isn’t to solve agent security once and for all. It’s to survive.
- Minimise blast radius.
- Avoid irreversible actions.
- Keep humans at the serious forks in the road.
- Assume your safeguards are weaker than they look.
The Perimeter, In Practice
Most of this essay is argument. Here’s a piece of evidence that isn’t.
Dan Guido at Trail of Bits — a 140-person security consultancy whose day job is finding vulnerabilities for a living — published their full AI-native operating system at the end of March 2026. A year earlier, 5% of the firm was on board with AI. Today, on the right engagements, they’ve gone from finding roughly 15 bugs a week to 200. One in five of all findings reported to clients now originates from an AI agent. Their sales team ships at about twice the consulting industry benchmark per rep.
The interesting part isn’t the numbers. It’s what they built to make the numbers possible — and how much of it is, effectively, a perimeter.
- A single standardised agent toolchain (Claude Code), so there’s a supported surface to govern.
- An AI Handbook that codifies which tools are approved on which data. Cursor isn’t allowed on most client code. Meeting recorders are banned for privileged client meetings. These are capability-scoping decisions, not prompt tricks.
- A curated marketplace of skills and plugins — a safe supply chain — because once you tell people to “go use skills,” they’ll install random things.
- Sandboxing as the default, shipped three ways: devcontainers, native macOS sandboxing, and their own Dropkit. Autonomy isn’t a vibe; it’s a box the agent lives in.
- Hardened defaults pushed through MDM — including mandatory package cooldown policies on npm. The easiest way to reduce risk is to make the safe path the default path.
- For sensitive clients: no web access at all. Blunt, but it’s the kind of blunt instrument real security uses.
And then the part I care about most — how they frame the thing content filters cannot fix. Straight from their “open questions” section:
The data the agent works on is inherently accessible to it.
That’s this entire essay in one sentence, written by a security firm. You can’t stop a tool from seeing the data you’ve given it. The only durable answer is to restrict what it can do with that data, where it can send it, and what it can reach next. Trail of Bits is now exploring agent-native shells like nono and agentsh that enforce policy at the kernel level. That is what authenticated, deterministic trust looks like, one layer further down the stack.
The same logic gets sharper in multi-agent systems. A compromised agent doesn’t have to destroy the system directly. It can poison the next agent’s context, trigger the wrong tool, or turn a handful of low-severity permissions into one high-severity path through the organisation.
When a security firm spells out, in public, that the answer to AI-agent risk is a sandbox, a supply chain, a capability policy, and kernel-level enforcement — not a cleverer system prompt — the argument isn’t philosophical anymore. It’s a running system, open sourced on GitHub, shipping client work every week.
AI Is Actually Better At This
One last thing that gets lost in the anxiety.
Humans are terrible at detecting prompt injection. We have a name for it in the human context: phishing. An email arrives, shaped to look trustworthy, carrying an instruction designed to manipulate. Classic prompt injection. We fall for it constantly.
Most people who get phished aren’t stupid. They just don’t have the technical wiring to notice the signals. The model does. A well-placed AI, given access to metadata and context, catches phishing patterns humans miss every single day.
The world is about to get safer on this front, not less safe. We just won’t frame it that way, because the same cognitive bias that makes us forgive human drivers will make us punish any AI that slips up once.
We accept humans making mistakes in traffic. We should accept machines making fewer ones — and still crediting the net win.
It’s All About Trust
Strip it all down and this is what’s left.
AI agent security isn’t a content problem. It’s a trust problem.
- Where you place trust. Which tools. Which sources. Which humans.
- How you verify it. Signatures. Tokens. Identity. Not vibes.
- What the agent can do when trust is present. A lot.
- What the agent can do when it isn’t. As little as possible.
That’s the whole game. Everything else is decoration.
The biggest risk isn’t that AI agents are uncontrollable. It’s that we convince ourselves they’re controlled when they aren’t — because we built an elaborate, comforting set of controls and declared victory. Activity is not the same as progress. Complexity is not the same as robustness. And a dashboard of green checks is not the same as a safe system.
Trust the perimeter. Authenticate the source. Design the environment. Survive long enough for the platforms to catch up.
If you’re interested in the deterministic side of this — how to stop relying on instruction and start relying on enforcement — I’d start with Don’t Just Tell It. Enforce It. If you want the practice-level version for engineers, Shifting Gears is the place to go next. And if you want the foundation underneath all of this — the infrastructure argument — Still True is where it lives.