Fixing Claude with Claude: Anthropic reports on AI site reliability engineering

Published

QCon London A member of Anthropic's AI reliability engineering team spoke at QCon London on why Claude excels at finding issues but still makes a poor substitute for a site reliability engineer (SRE), constantly mistaking correlation with causation.

Alex Palcuie was formerly an SRE for Google Cloud Platform. "My job is keeping Claude up," Palcuie said, adding: "I've been using LLMs for actual incident response." Since January, he's been reaching for Claude before looking at other monitoring tools.

Alex Palcuie speaks at QCon London 2026
Alex Palcuie speaks at QCon London 2026

His team is busy. "Claude goes down more often than any of us would like. Earlier today, I was involved in an incident, even if I'm at a conference."

Is Palcuie automating himself out of a job? No, he said. "It would be hypocritical to say that Claude fixes everything. My team exists, we're hiring for many positions, this should show you that no, it doesn't work."

However, he said "many of us would not be surprised" if it did work in future, and his talk demonstrated that AI is already helpful.

Speaking of his career in incident response, Palcuie reflected that having engineers on call is "a tax on humans because our systems are not good enough to look after themselves." Palcuie spoke of the stress of being on call. "Your phone buzzes, there's half a second where you go from asleep, to incident commander mode... then at 9:00 am you show up at work and have to look professional and presentable."

Incident response, he said, can loosely be broken down into a loop of four phases: observe, orient, decide, act. 

AI, he said, is fantastic for the observation part. "It reads the logs at the speed of I/O, it doesn't get bored, this at scale is something no human can match."

He recounts a real incident when, on New Year's Eve, Claude Opus 4.5 was returning HTTP 500 errors. "I open Claude Code and ask it to have a look." The AI wrote a SQL query and "within seconds it has the answer, an unhandled exception in the image processing class." It posts the Python stack trace but "it doesn't stop there." Claude identified the failing requests, checked the accounts that sent them, and found 200 accounts "all sending 22 images at the same time." That looked suspicious. Claude looked further and found 4,000 accounts all created at the same time, most sitting dormant. The AI said: "Stop looking at the 500s, this is fraud."

Without AI, "I would have marked this as a bug, I would not have paged account abuse," Palcuie said.

His next anecdote is less positive. AI processing relies on a key-value (KV) cache for performance. "This KV cache can be gigabytes in size, it's really easy to break it, it's finicky, it's fragile." When it breaks, it causes a lot of extra compute and monitoring shows many more requests.

"Every single time, I would ask Claude, what happened here? Claude would say, request volume increase, this is a capacity problem, you need to add more servers."

The problem, he said, is that Claude "will get wrong correlation versus causation." It's like a new joiner on the team, they will think "oh, it's a capacity problem, when actually you lost your cache."

"This is why we can't trust LLMs for incident response," said Palcuie. The problem is its inability to "step back and start discerning between causation and correlation... For us humans, it is hard as well."

When Claude is asked to produce a postmortem report, it delivers "an 80 percent story that's pretty, it's readable and convincing," said Palcuie, but "it's really bad at root causes." Claude says "this was the thing, and we all know it is not one thing. It's not one root cause... It was never the rollout. It was never the code change. It was all the processes in the company that allowed the incident. And Claude doesn't know the history of your system, especially if your system has been there for ten years."

It is important, said Palcuie, to have SREs that "have been burnt before... they have the scar tissue." He worries that if AI is used more, "will we have our skills atrophy?" – in parallel with the concerns software developers often express regarding having AI write most of the code.

The Jevons Paradox, said Palcuie, is "the favorite paradox in the AI industry. It's when technological improvements increase the efficiency of our resources used, but the resulting lower cost causes consumption to rise rather than fall."

In the case of software, "it's easier to write software, so we write much more of it, so the complexity goes up and not down, which means things break in more interesting ways, which means more incidents, more on call... all the improvements in the tooling will be cancelled by this ever-growing complexity."

Maybe, said Palcuie, AI agents can simplify and manage the complexity, maybe "do what we've collectively learned in our industry, but that's a big if."

He ended on a positive note, saying: "The models are the worst today that they'll ever be."

The overall story, though, is not to leave SRE to AI and keep training reliability engineers because they will be needed in future.