arxiv.org web signal

Frontiers paper recasts GenAI as cultural 'context machines'

TL;DR

  • A 37-author paper led by Cody Kommers, published in Frontiers in Artificial Intelligence, proposes a framework called 'computational hermeneutics' for evaluating GenAI.
  • It argues GenAI systems are 'context machines' that must address three interpretive challenges: situatedness, plurality, and ambiguity.
  • Its three evaluation principles say benchmarks should be iterative, include people not just machines, and measure cultural context rather than only model output.

A 37-author paper that just landed in Frontiers in Artificial Intelligence is worth flagging not because it solves model evaluation, but because it argues most of the field is measuring the wrong thing. The piece, led by Cody Kommers and co-signed by humanities and ML researchers including Ruth Ahnert and Maria Antoniak, calls generative models 'context machines' and says current benchmarks treat culture as a variable to optimize when culture is actually the medium the systems work in.

The framing, published on arXiv as 'Computational Hermeneutics', is that GenAI inherently faces three interpretive challenges. Situatedness: meaning only emerges in context. Plurality: multiple valid interpretations coexist without resolving into a single correct reading. Ambiguity: interpretations conflict through what Gadamer called the 'fusion of horizons' between an interpreter and an artifact. The authors borrow these from hermeneutic theory in the humanities, then argue they describe what next-token prediction is actually doing.

Why this should matter to anyone shipping models or buying them: the benchmark you trust shapes the model you get. The paper's prescription is concrete on the meta-level. Benchmarks should be iterative, not one-off; should include people, not just machines; and should measure cultural context, not just model output. But the authors do not deliver a drop-in replacement for MMLU or HumanEval. They are arguing about what good measurement would even look like.

The honest caveat is that this is a position paper, not a benchmark release, and 'measure cultural context' is the kind of recommendation that can quietly evaporate once it meets a quarterly model release schedule. What the reporting does not give you is an operational answer for how an eval team would actually run a hermeneutic benchmark cheaply enough to gate weekly checkpoints, or how the framework scales beyond text to image and audio systems where the interpretive stakes are arguably higher.

Still, the direction is the part worth watching. If even one major lab or evals shop takes the iterative, human-in-the-loop framing seriously, the gap between what benchmarks reward and what users actually experience could narrow. That gap is currently where a lot of the credibility problem lives.

Shared on Bluesky by 1 AI expert