reddit.com via Reddit May 26th 2026

Cactus Hybrid Router cuts cloud AI queries to 55%

open source inference edge ai local-ai inference open-source

Key insights

A 65k-parameter router achieves Gemini-3.1-Flash-Lite performance parity by sending only 15-55% of queries to the cloud.
The Needle team shipped two specialized micro-models in one week: a 26M-param function-call specialist and a 65k-param router.
Cactus targets mobile and edge deployments where cloud API cost or latency is a binding constraint, not a preference.

Why this matters

Hybrid routing at 65k parameters proves that orchestration intelligence can live fully on-device without sacrificing quality against frontier cloud models, which directly rewrites the cost model for mobile AI applications. For founders building on-device AI products, Cactus shifts cloud inference from a default to a fallback, cutting per-query API spend by 45-85% depending on workload distribution. The two-model release cadence in a single week, driven by open-source community demand, establishes a new product archetype where small teams ship task-specific model suites rather than competing on scale.

Summary

The team behind Needle shipped Cactus Hybrid Router, a 65k-parameter model that decides per-query whether to run locally on Gemma4-2B or forward to cloud-hosted Gemini-3.1-Flash-Lite. The combined system matches Gemini's full performance while routing only 15-55% of queries to the cloud. The router adds negligible overhead at 65k parameters, making it viable on mobile and edge hardware where API cost or latency is a hard constraint. Routing happens per query, not per session, enabling dynamic cost-quality tradeoffs without user friction. Essentially: (Needle team, Google Gemini) a pairing that turns Gemma4-2B into a competitive cloud replacement at a fraction of the per-query cost. - 65k-parameter router is roughly 1/400th the size of the local model it manages. - Cloud utilization stays at 15-55% while matching full Gemini-3.1-Flash-Lite performance. - Shipped one week after Needle, directly in response to community requests, signaling a deliberate micro-model product suite. Task-specific micro-models handling orchestration work previously reserved for large models is the structural shift worth watching.

Potential risks and opportunities

Risks

Google repricing or deprecating Gemini-3.1-Flash-Lite would invalidate both the benchmark comparisons and the cost assumptions for apps built on Cactus today.
Enterprises adopting hybrid routing may underestimate data governance exposure, since 15-55% of queries still leave the device and traverse Google's cloud infrastructure.
If the router's performance parity claim is benchmark-specific rather than general, developers who deploy Cactus across diverse query types could see quality regressions without clear diagnostic signals.

Opportunities

Mobile AI app developers on Android and iOS can cut Gemini API spend by 45-85% by integrating Cactus Router alongside existing Gemma4-2B deployments, with no model retraining required.
Edge hardware vendors such as Qualcomm and MediaTek can use Cactus-style hybrid routing as a reference benchmark in their next on-device AI SDK releases to demonstrate inference efficiency.
The Needle team is building a task-specific micro-model suite with rapid community validation, making it a credible acquisition target for mobile AI infrastructure players including Google, Qualcomm, and Samsung.

What we don't know yet

Benchmark methodology for performance parity with Gemini-3.1-Flash-Lite is not fully disclosed, making it unclear which task domains drive the 15% vs 55% cloud routing rates.
Licensing terms for Cactus Router have not been specified, which is a blocking question for commercial mobile deployment.
Whether the router generalizes across query domains or requires per-domain fine-tuning to hold performance parity with Gemini-3.1-Flash-Lite.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Cactus Hybrid Router — 65k-Param Model Lets Gemma4-2B Match Gemini-3.1-Flash-Lite by Routing Only 15–55% of Queries to Cloud