AgenticSTS Testbed Doubles LLM Wins in Slay the Spire 2
TL;DR
- AgenticSTS uses Slay the Spire 2, where five prior LLM configurations reported zero wins at the lowest difficulty and humans win 16 percent.
- Adding a strategic skills memory layer took wins from 3 of 10 to 6 of 10, with Fisher exact p about 0.37 (directional, not decisive).
- The release includes 298 completed trajectories with condition tags, frozen memory and skill snapshots, prompt records, and analysis scripts.
Most agent research right now is a race to stuff more tokens into context and hope the model sorts it out. The AgenticSTS paper on Hugging Face goes the other way. Every decision is made from a fresh user message assembled by typed retrieval, no raw cross-decision transcript is appended, and prompts stay bounded across runs of any length. The point of the design is that individual memory layers can be ablated in isolation, which is the thing that has been missing from most long-horizon agent write-ups.
The testbed is Slay the Spire 2, a closed-rule stochastic deck-building game where a run takes hundreds of tactical and strategic decisions. The authors note that prior benchmark data showed zero wins at the lowest difficulty across five LLM configurations, against a human win rate of 16 percent at the same difficulty. That is a useful place to be measuring from, because there is no ceiling to hit and no leaderboard to saturate.
In the fixed ablation, a no-store baseline won 3 of 10 games and a version with the strategic skills layer enabled won 6 of 10. Doubling sounds dramatic, and the authors are careful to flag that Fisher exact p is about 0.37, which they describe as directional rather than statistically decisive at this sample size. Take the specifics as reported, not settled.
The honest caveat is that this is a deck builder, not a browsing or coding agent, and a 10-game slice is thin. What the paper does not give you is which model sits behind the numbers, or what exactly the strategic skills layer stores between runs. Those matter if you want to port the pattern.
Where this becomes useful for the rest of the field is the release. The team is shipping 298 completed trajectories with condition tags, frozen memory and skill snapshots, prompt records, and analysis scripts, alongside a project page and code repository. For anyone trying to figure out which of their agent's memory layers is actually earning its tokens, that is a ready-made ablation bed to borrow rather than rebuild.
Originally reported by huggingface.co
Read the original article →Original headline: HF Paper 'AgenticSTS': Bounded-Memory Testbed Uses Slay the Spire 2 to Study Long-Horizon LLM Agents — Adding Strategic Skills Layer Doubles Wins From 3/10 to 6/10 Where Frontier LLMs Previously Reported Zero