r/AI_Agents: Controlled 90-Day Benchmark Finds Expensive LLMs Underperform Cheaper Models as Trading Agents
Summary
A researcher on r/AI_Agents published results from a 90-day paper-trading benchmark comparing GPT-4o, Claude Opus 4, and Gemini Ultra against GPT-3.5 Turbo and Claude Haiku across 200 standardized equity and options scenarios. The more capable models showed stronger Sharpe ratios on complex multi-leg options strategies but significantly underperformed on straightforward momentum trades where overthinking introduced latency and second-guessing. The authors conclude that for rules-based or high-frequency strategies, model cost and capability tier are poor proxies for trading agent quality.
Originally reported by Reddit r/AI_Agents
Read the original article →Original headline: r/AI_Agents: Controlled 90-Day Benchmark Finds Expensive LLMs Underperform Cheaper Models as Trading Agents