arxiv.org web signal

Bamman group finds LLMs lag on de novo humanities tasks

TL;DR

  • Researchers compared GPT-4o, Llama 3 70B and Mixtral 8x22B against BERT, RoBERTa and fine-tuned Llama 3 8B across ten cultural analytics datasets.
  • On folktales GPT-4o reached 0.838 accuracy versus RoBERTa's 0.616, but on literary time regression RoBERTa scored ρ=0.782 against GPT-4o's 0.485.
  • The paper concludes prompt-based LLMs are competitive with supervised models on established tasks but perform less well on newly constructed de novo phenomena.

A new paper from David Bamman's group pokes at a question a lot of computational humanities researchers have been quietly avoiding. When you ask GPT-4o or Llama 3 to classify a passage of fiction or a folktale by some cultural category, are you actually doing the same thing a labeled supervised classifier does, or are you doing something subtly different and worse?

The team evaluated prompt-based LLMs (GPT-4o, Llama 3 70B, Mixtral 8x22B) against traditional supervised models (logistic regression, BERT, RoBERTa, and a fine-tuned Llama 3 8B) across ten cultural analytics datasets. Tasks ranged from emotion, genre, and haiku to folktales, stream-of-consciousness, narrativity, and 'strangeness.' The headline finding, as the authors put it on arXiv, is that 'prompt-based LLMs are competitive with traditional supervised models for established tasks, but perform less well on de novo tasks.'

The split is visible in the numbers. On folktale classification, GPT-4o reached 0.838 accuracy against RoBERTa's 0.616. On stream-of-consciousness, the fine-tuned Llama 3 8B scored 0.945 versus RoBERTa's 0.900. But on literary time, a regression task, RoBERTa hit ρ=0.782 while GPT-4o managed only 0.485. The pattern the authors draw out is that tasks measuring concepts already widely known in the training distribution play to the LLMs' strengths, while researcher-defined phenomena that require fresh labeled data still favor supervision.

The honest caveat is that 'established' and 'de novo' are doing a lot of work in that framing, and ten datasets is not a settled taxonomy. The reporting also does not address how sensitive these numbers are to prompt phrasing, which matters for anyone trying to reproduce the results in their own corpus.

The forward-looking read the authors offer is that LLMs may have a real role in exploratory data analysis: outlining potential characteristics that can then be subjected to more formal testing. For humanities teams without annotation budgets, that is a route into the work. For anyone studying a genuinely new phenomenon, the labeled-data tradition is not going away.

Shared on Bluesky by 1 AI expert