Tools · · 2 min read

Cactus ships hybrid on-device + cloud-fallback inference — Gemma 4 2B runs local, hard requests handoff to Claude/GPT/Gemini

Cactus Compute (YC S25, 5.2k★ on GitHub) ships a single SDK that runs Gemma 4 2B locally on iOS/Android/macOS/Linux and automatically hands off complex requests to a cloud model of your choice. v1.14 shipped April 18. The bet: builders shouldn't have to pick local vs cloud — the router should.


Cactus Compute (Y Combinator S25 batch, founded by Henry Ndubuaku) ships what it calls the “industry’s first hybrid inference stack” — a single SDK where Gemma 4 2B runs locally on your device, evaluates each request’s complexity, and automatically hands off to a cloud model when it can’t handle the request confidently.

The pitch: builders shouldn’t have to pick local vs cloud — the router should. v1.14 shipped April 18, 2026; the repo at cactus-compute/cactus sits at 5.2k stars / 414 forks.

What it actually is

From Cactus’s own Gemma 4 announcement:

“The on-device model evaluates the complexity of the request, and if it determines it can’t handle it confidently, it signals for handoff.”

The components:

  • On-device model: Gemma 4 (E2B = 2.3B effective params / E4B = 4.5B effective params), described as “the first on-device model that genuinely works across text, vision, and audio in a single architecture”
  • Hybrid router: bundled with the SDK, no separate service
  • Cloud fallback: routes hard requests to a cloud LLM of your choice (Claude / GPT-4 / Gemini / others)

Where it runs

PlatformStatus
macOS
iOS
Android
Linux
Windowsnot listed

SDK languages: React Native, Flutter, Swift, Kotlin, Python, Rust, C++ — the broadest mobile + native coverage of any local inference framework right now.

The performance number that matters

Cactus’s homepage demo shows the hybrid router routing voice transcription with ~467ms end-to-end latency when bouncing between on-device and cloud. The docs cite ”30s audio end-to-end 0.3s” on M5 Mac / iPad / Vision Pro.

This is the actual win: when the on-device model handles it (~80%+ of common requests, per their pitch), you get near-zero latency + zero per-token cost. When it can’t, you fall back to a cloud model at normal API cost — but only for that fraction of requests.

Why this matters for builders

Three patterns this changes:

1. The “API-only” default is now suboptimal for mobile apps. If you’re shipping a chat / transcription / vision feature in an iOS or Android app, sending every request to a cloud API is wasteful — most user requests are simple enough for a 2B model. Cactus says: route them locally by default, only fall back when needed.

2. The “fully local” purist position is also wrong. Sometimes a request actually needs frontier-tier capability — for those, falling back to Claude / GPT-4 is the right answer. Cactus is taking the engineering position that one model can’t do everything well, and the router should decide.

3. This pairs with the cost-efficient cloud tier. If your fallback target is DeepSeek V4 Pro or Reasonix-style cache-engineered loops instead of Claude / GPT-4, the per-token cost of the cloud half drops 5-10×. Cactus + DeepSeek-tier fallback is probably the cheapest viable architecture for builder-grade mobile AI in 2026.

For the closed-model comparison frame (when Gemma 4 2B locally beats / loses to API-tier Gemini Flash / Claude Haiku), see our Gemini 3.5 Flash vs Claude Haiku 4.5 deep dive.

What’s still unclear

  • License: not explicitly shown in the README excerpt; likely permissive given the YC-backed builder positioning, but verify before commercial deployment.
  • Routing decision algorithm: Cactus describes “confidence-based handoff” but the v1.14 docs don’t disclose the exact mechanism (threshold / scoring / few-shot classification).
  • Windows support: notably absent from the listed platforms.
  • Per-request cost ceiling: no published cap, so a misbehaving router that punts everything to cloud could spike your bill.

For builders evaluating this against the just-released Harbor v0.4.19 (which targets local-agent-CLI ergonomics on desktop) and PrismML Bonsai Image 4B (the on-device diffusion model that landed yesterday): Cactus is the mobile / embedded play, Harbor is the desktop play, PrismML is the image play. They stack rather than compete.

Sources

Source: Cactus Docs