Ops in a Frock: Why Most SRE is Still Theatre

AI kills toil. That’s what SRE was always meant to do.

Google wrote the handbook over a decade ago. The principles were clear: define reliability through SLOs and error budgets. Engineer toil away systematically. Share responsibility — the people who build it, run it. Treat operations as a software engineering problem, not a staffing problem.

Simple ideas. Genuinely powerful. And almost universally misunderstood.

What I keep walking into

I’ve lost count of the number of organisations I’ve joined that say they “do SRE” or have a “Platform Team.” Here’s what that usually means in practice:

Pipeline jockeys cranking YAML. A team of people whose entire job is writing and maintaining CI/CD configurations. They don’t own reliability. They don’t define SLOs. They maintain pipelines. They’re build engineers with a trendy title.

Firefighting teams on endless rota. An operations team that’s been renamed “SRE” but still spends 80% of their time responding to incidents they didn’t cause, for services they didn’t build, owned by teams that went home at 5pm. They’re on-call. The developers aren’t. Nothing has changed except the job title.

Developers still throwing features over the wall. The product teams ship code and move on. Someone else deals with deployment, monitoring, and what happens at 3am. The “SRE team” or “Platform Team” is just a buffer between developers and the consequences of their decisions.

That’s not SRE. That’s theatre. Ops in a frock.

The Platform Team rebrand

The latest iteration is particularly frustrating. “Platform Engineering” is everywhere right now — the CNCF 2024 annual survey found it’s rapidly becoming table stakes, though only about 9% of organisations are truly mature by their own maturity model’s standards. The pitch is compelling: build an internal developer platform that makes it easy for product teams to deploy, monitor, and operate their services. Golden paths. Self-service infrastructure. Developer experience as a product.

All good ideas. But here’s the test: do your product developers own production?

If the answer is no — if there’s still a handoff between the people who write the code and the people who get paged when it breaks — then you haven’t adopted Platform Engineering. You’ve rebranded your ops team again. The platform is just a nicer wall to throw things over.

The point of a platform is to enable ownership, not replace it. The platform team builds the tools, guardrails, and abstractions that make it possible for product teams to own their services end-to-end. The product teams use those tools to deploy, monitor, and operate their own services. Nobody throws anything over any wall because there is no wall.

When I see a “Platform Team” that also runs the on-call rota for product services, I know exactly what’s happened: the org wanted the label without the cultural change.

What Google actually meant

Go back and read the original SRE book. The core ideas are deceptively simple:

SLOs and error budgets. You define what “reliable enough” means for each service, and you measure it. When you’re within budget, you ship features fast. When you’re burning through your error budget, you slow down and fix reliability. This isn’t a technical practice — it’s a business decision framework. It forces the conversation between product velocity and reliability to happen with data, not politics.

Toil is the enemy. Toil is manual, repetitive, automatable work that scales linearly with service growth. Google’s target: SREs should spend no more than 50% of their time on toil. The rest should be engineering — building systems and automation that permanently eliminate categories of work.

Shared ownership. The people who build services share responsibility for running them. Not exclusively — SREs provide expertise, tooling, and consultation. But developers don’t get to write code and walk away. They’re on the hook for the operational characteristics of what they build.

Most organisations adopted the job title and ignored the principles. They hired “SREs” to do the same ops work that had always needed doing, and declared victory.

AI changes the economics

Here’s where it gets interesting. The toil that Google wanted to engineer away? AI is eating it alive.

Pipeline configurations, infrastructure templates, monitoring rules, runbook automation, incident triage, config generation, security scanning — AI can do all of this faster than a human. Not perfectly, but well enough. And it’s getting better every month.

This should be liberating. All that mechanical work that SRE teams spend their days on — the YAML cranking, the Terraform modules, the alert tuning, the deployment scripts — can increasingly be handled by AI tools. The toil that nobody could find time to automate is being automated anyway, by a different mechanism.

But it’s also exposing a problem: if toil was all your “SRE team” was doing, what’s left?

What’s left is the actual job

When you strip away the toil, what remains is the real engineering work that most organisations never got around to:

Defining what reliability means. Not in the abstract — for this service, for these users, with these business constraints. Setting SLOs that are meaningful, measurable, and connected to customer experience. Making error budgets real, not theoretical.

Designing systems that are reliable by construction. Not firefighting after the fact — designing for failure upfront. Graceful degradation. Circuit breakers. Capacity planning. The architectural work that prevents incidents rather than responding to them.

Owning production as engineers, not operators. Understanding the runtime behaviour of systems. Knowing why things fail, not just how to restart them. Treating production as a source of information, not a source of pages.

Making hard trade-offs visible. “We can ship this feature by Friday, but it’ll cost us half our error budget for the quarter. Here’s the data. What do we want to do?” That conversation requires context, judgement, and credibility that AI can’t provide.

The honest question

With AI killing the toil, are you engineering reliability? Or are you still playing theatre?

Here’s how to tell:

Do your product teams own their SLOs? Not the SRE team. Not the platform team. The people who write the code and make the product decisions. If reliability is someone else’s problem, you’re doing theatre.
Do you have error budgets that actually affect decisions? If nobody has ever slowed down a release because the error budget was exhausted, the SLOs are decoration.
Is your “SRE team” doing engineering or operations? If more than half their time is spent on reactive work — toil, firefighting, manual deployments — they’re an ops team with a modern title. AI is about to make that very obvious.
Could AI replace your SRE team? If the answer is “most of what they do, yes” — then what they’re doing isn’t SRE. It’s the toil that SRE was supposed to eliminate.

The organisations that will thrive are the ones that use AI to finally deliver on the promise that SRE made a decade ago: eliminate the toil, share the ownership, and focus human engineers on the hard problems that actually require human judgement.

The ones that don’t will keep renaming their ops team every three years and wondering why nothing changes.