themaybe.org web signal

Heidy Khlaaf: AI Companies Are Co-Opting Safety Engineering

TL;DR

  • AI systems achieve 30-50% accuracy while safety-critical systems require 99.99% reliability standards.
  • An Air Force targeting system promised 90% accuracy but delivered only 25% in practice.
  • Khlaaf argues existing military and aviation safety standards should simply be enforced on AI systems.

Safety engineering has a specific and measurable meaning that predates the AI industry by decades: preventing death, environmental damage, and asset loss in systems where failure is catastrophic. On The Maybe podcast, Heidy Khlaaf, Chief AI Scientist at the AI Now Institute, argues that AI companies have systematically co-opted that terminology, redirecting public conversation away from concrete human harm and toward abstract concepts like alignment and existential risk.

The numbers make the problem concrete. Safety-critical systems, from energy grids to aircraft to nuclear plants, require reliability standards of 99.99%. Current AI systems, according to Khlaaf, achieve 30-50% accuracy. The phrase "AI safety" is being applied across both ends of that range as though the gap between them does not exist.

Her military examples are the most pointed. An Air Force targeting system that reportedly promised 90% accuracy delivered only 25% in practice. In the context of Gaza, Israeli verification algorithms for civilian casualty assessment relied on cellular network connectivity as a proxy metric, a measure disconnected from wartime reality where displaced populations may lack functioning phones. These are not theoretical failure modes but cases where an accepted metric stood in for the underlying thing it was supposed to measure.

Khlaaf identifies several structural features that reinforce the problem. Independent verification is being sidelined across military, nuclear, and healthcare sectors. New safety benchmarks appear daily but lack the specificity to prove anything about real-world deployment. And because large language models produce variable outputs, consistent safety validation is structurally difficult to achieve. She describes AI companies as effectively grading their own homework on safety claims.

What she recommends is not a new framework but enforcement of existing ones: apply military and aviation safety standards to AI systems where applicable, and use democratic processes to determine acceptable harm thresholds for novel applications like chatbots, a deliberation she argues is currently absent from deployment decisions. What the episode does not give you is any detail on which regulator holds jurisdiction to make that enforcement happen, or what the path from principle to policy actually looks like. But the underlying premise, that "how safe is safe enough" is a democratic question and not a technical one to be resolved by the companies doing the deploying, is the part worth sitting with.

Shared on Bluesky by 3 AI experts