arxiv.org web signal

Adversarial Concept Search Predicts LLM Compositional Failures

TL;DR

  • A Compositional Interference metric derived from feature geometry predicts LLM failures without evaluating specific inputs.
  • On multihop question answering, correlation between the CI metric and model accuracy reached r = -0.855.
  • The method predicts cross-lingual transfer failures across 10+ languages using only English fact representations.

A persistent blind spot in language model evaluation is that we usually only discover what a model cannot do after it fails on actual inputs. Researchers from Brown University, Harvard University, USC, and Boston University propose a different approach: use the model's own internal geometry to predict compositional failure before it happens.

The core claim is about angular relationships between feature encodings. When two concepts are encoded near-orthogonally in a model's representational space, the model tends to compose them correctly. When their encodings are proximate, they produce interference, and compositional failures follow. The team formalizes this with a Compositional Interference (CI) metric, a normalized measure derived from local cumulative coherence, and shows it can anticipate failure modes without evaluating specific inputs.

Results were reported across three settings: a synthetic task called SCAN, multihop question answering, and multilingual factual recall. On the multihop task, the correlation between CI and accuracy reached r = -0.855. On SCAN, CI ranking significantly outperformed random ordering (p<0.01) across model sizes from 8 to 64 dimensions and training coverage ratios from 4% to 80%. For multilingual recall, the method predicts cross-lingual transfer failures across more than 10 languages using only English fact representations plus target-language subspaces, a relatively low overhead for pipelines where comprehensive evaluation is expensive.

The honest caveat is that the experimental scope is contained: synthetic benchmarks and structured QA tasks, not open-ended production settings. The paper does not give you a recipe for fixing the failures it identifies, only for finding them. The geometry tells you where the model will stumble, but not in a way that directly suggests a training remedy. What the reporting also does not give you is evidence that the approach holds at the scale of very large production models.

The practical case is clearest for teams building multilingual systems or multi-hop reasoning pipelines who need to prioritize stress testing or annotation budgets. The authors describe the method as a foundation for active learning, and that framing fits: if geometric structure predicts which concept pairs are risky, you can direct evaluation resources where they matter rather than sampling broadly and hoping to find the hard cases.

Shared on Bluesky by 2 AI experts