reddit.com via Reddit

r/LocalLLaMA: Developer Benchmarks Frontier LLMs on Zero-Shot Sokoban Spatial Reasoning — Custom Maps Designed to Eliminate Shortcut Guessing, Most Models Fail Geometry-Logic Test

hallucinations llm-benchmarks spatial-reasoning capabilities

Summary

A developer ran a zero-shot Sokoban (box-pushing puzzle) benchmark across multiple frontier LLMs to test spatial geometry and logical reasoning under strict 2D constraints, using custom maps designed to eliminate shortcut guessing. Results document systematic failure modes across most models even on tasks solvable by humans in seconds, with the post reporting which models showed partial success and what error patterns recurred. Community discussion is debating whether spatial reasoning is a fundamental gap in current transformer architectures or a solvable training-data problem.