arxiv.org web signal July 3rd 2026

Study: LLM scaling improves social sims, fails on cognitive bias

TL;DR

A new arXiv paper tests 85 Qwen3-architecture LLMs and 35 larger open-weight models up to 70B parameters on three social simulation domains.
Compute scaling reliably improves opinion and behavioral simulation for populations well-represented in English web corpora, but longitudinal forecasting and underrepresented opinions scale more slowly.
Scaling and fine-tuning from 0.5B to 8B parameters fail to improve model calibration with human cognitive biases like risk aversion.

Something interesting sits under a plain-looking research question. If you use LLMs to simulate populations for market research, policy work, or agent-based studies, does simply waiting for the next scaled-up model close the fidelity gap? A new arXiv paper by Caleb Ziems, William Held, Su Doga Karaca, David Grusky, Tatsunori Hashimoto, and Diyi Yang says: mostly yes, with exceptions you should probably care about.

The setup is unusually thorough for this kind of question. The authors run scaling laws over 85 transformer LLMs built on the Qwen3 architecture, pre-trained on the DCLM web text corpus under fixed compute budgets from 10^18 to 10^20 FLOPs, then extend the analysis to 35 larger open-weight models up to 70B parameters. Three simulation domains get tested: opinion modeling, behavioral simulation, and longitudinal forecasting.

The headline finding is that compute scaling is strong across all three. The authors report that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when the populations involved are well-represented in English web corpora. That is the good news for anyone hoping the next generation of frontier models makes LLM-driven simulation more usable off the shelf.

The awkward finding is the shape of the exceptions. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. More striking, scaling fails to improve model calibration with human cognitive biases like risk aversion, and with heuristics like learning correlated rewards from related tasks. On those tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters.

The honest caveat is that the study is bounded by the models it tested, mostly a Qwen3 pretraining family plus open-weight releases up to 70B, so it does not tell you whether a very different architecture or a bespoke training regime would break the plateau. What the reporting does not give you is a rule for which downstream applications the low-resource caveat should disqualify entirely versus just hedge.

If the paper's read holds, the useful move for the field is to stop treating social simulation as a byproduct of general capability. The parts that matter most for policy work and market research, forecasting, bias calibration, underrepresented voices, are exactly the parts that do not automatically ride the scaling curve. That is the direction worth watching, and it argues for dedicated research effort rather than patience.

Shared on Bluesky by 2 AI experts

Originally reported by arxiv.org

Read the original article →

Original headline: Will Scaling Improve Social Simulation with LLMs?