How to Run Bonsai Image 4B Locally: On-Device Text-to-Image on Mac & PC
Read time: ~7 minutes. What you’ll learn: how to install and run PrismML’s Bonsai Image 4B on a Mac (Apple Silicon) or an NVIDIA PC, which quantized variant to download, the one-shot command to generate an image, and how to run the local Studio (a web UI + API). Every command is taken verbatim from PrismML’s official demo repo — copy-paste safe.
For years, shipping image generation in a product meant one of two things: pay an API per image, or rent a GPU to host SDXL/Flux. Bonsai Image 4B changes the math — it’s a text-to-image diffusion model compressed to under ~1.2 GB that runs entirely on your own device, generating a 512×512 image in about 6 seconds on an M4 Pro Mac. Apache 2.0, weights and code open.
This is the hands-on guide to running it locally. For the full benchmark story — sizes, compression ratios, and the honest caveats — see the Bonsai Image 4B release breakdown. Here we’re getting images coming out of your own machine.
1. What you’re running
Bonsai Image 4B is a 4B-parameter diffusion transformer that PrismML quantized down to extreme low-bit precision. Two variants ship (numbers are PrismML’s own, self-reported):
| Variant | Size | Compression vs FP16 | Quality retention |
|---|---|---|---|
| Ternary (1.58-bit) | 1.21 GB | 6.4× | up to 95% |
| Binary (1-bit) | 0.93 GB | 8.3× | up to 95% |
On real hardware PrismML reports ~6 s per 512×512 image on an M4 Pro Mac and ~9.4 s on an iPhone 17 Pro Max. The license is Apache 2.0 for both weights and inference code — commercial-use friendly with no redistribution clauses.
Which to pick: start with ternary (1.58-bit) — it’s the recommended default and the better-quality variant, and 1.21 GB is already tiny. Drop to the 1-bit binary only if you need the absolute smallest footprint (e.g. bundling into a mobile app).
2. Pick your backend
Bonsai runs on three setups, each with its own kernel backend:
- macOS (Apple Silicon) — uses the MLX backend via mflux. This is the smoothest path and where the ~6 s number comes from.
- Linux / Windows (NVIDIA GPU) — uses GemLite + HQQ kernels for the low-bit math.
- Windows — runs natively, no WSL2 required, via
triton-windows.
Pick the one matching your machine; the setup script auto-detects and installs the right backend.
3. Install
Clone the demo repo and run the setup script for your OS.
macOS / Linux:
./setup.sh
Windows (PowerShell):
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned
.\setup.ps1
setup.sh / setup.ps1 installs the Python dependencies and the correct kernel backend (MLX/mflux on Mac, GemLite+HQQ on NVIDIA, triton-windows on Windows) for your hardware.
4. Download the weights
The repo ships a downloader that pulls the quantized weights from Hugging Face. Grab the ternary variant (the recommended default):
./scripts/download_model.sh ternary # default — 1.58-bit, best quality
./scripts/download_model.sh binary # 1-bit variant — smallest
./scripts/download_model.sh --model binary-gemlite # explicit backend
The weights live in the prism-ml/bonsai-image Hugging Face collection, which also has MLX 2-bit/1-bit and GemLite 2-bit/1-bit packs. The downloader picks the right pack for the backend setup.sh installed.
5. Generate your first image
The fastest way to confirm everything works — a one-shot generation straight from the CLI:
macOS / Linux:
./scripts/generate.sh --prompt "An icy Bonsai tree, in a rainy forest with a snowy mountains in the background, photo realistic."
Windows:
.\scripts\generate.ps1 -p "An icy Bonsai tree, in a rainy forest with a snowy mountains in the background, photo realistic."
That writes a 512×512 image to disk. To control size, seed, and output path:
./scripts/generate.sh -p "..." --size 1248x832 --seed 9909 --output outputs/icy_bonsai.png
Two things to know about dimensions: the default resolution is 512×512, and any custom dimensions must be multiples of 32 (e.g. 1248x832 works, 1250x830 won’t). The --seed flag makes a generation reproducible — same prompt + same seed = same image, which is essential when you’re iterating on a prompt.
6. Run the local Studio (web UI + API)
For interactive use, the repo bundles a Studio: a FastAPI backend plus a Next.js frontend.
./scripts/serve.sh # FastAPI backend on :8000 + Next.js frontend on :3000
Open http://localhost:3000 for the web UI, or hit the API on :8000 directly. This is the setup you’d use to build image generation into your own app — the FastAPI endpoint gives you a clean local API to call, with zero per-image cost and nothing leaving the machine.
To drive the running server from the command line:
./scripts/send_request.sh -p "An icy bonsai tree..." --size 1248x832 --seed 9909
7. On-device / iPhone notes
PrismML’s headline claim is phone-class inference — ~9.4 s per image on an iPhone 17 Pro Max. A few practical caveats if you’re targeting mobile:
- iPhone inference uses the MLX 2-bit variant from the Hugging Face collection, and requires iOS 18+ on an Apple Silicon device.
- There’s no published Android port yet — this is an Apple-ecosystem story for now.
- The Mac MLX path (§2–§6 above) is the one to develop against first; it’s the same backend family and far easier to iterate on than a device build.
8. Local vs API — when each wins
Run Bonsai locally when you’re shipping image-gen inside a Mac/iOS/desktop app and want zero per-image cost, when images can’t leave the user’s device (privacy), or when you want offline generation. The ~6 s latency on an M4 Pro is slower than a cloud API round-trip, but fine for “user clicks generate and waits” UX.
Stay on an API (OpenAI / Gemini Imagen / Stability, ~$0.02–$0.04 per image) when you need top-tier quality on hard prompts, when you don’t control the client device, or when latency must be sub-second. Be honest about the ceiling: at 4B, Bonsai’s quality won’t match a 12B Flux or a flagship cloud model — the win is “good enough, on-device, free per image,” not SOTA.
One more caveat worth repeating from the release coverage: the “up to 95% quality retention” figure and the comparison grids are PrismML’s own, and curated samples always look stronger than a random prompt set. Generate across your actual prompt mix before betting a product on it.
If you’re already building an on-device stack — a local LLM via llama.cpp or a local document-extraction model like NuExtract 3 — Bonsai Image slots into the same thesis: AI workloads on the user’s hardware, not the cloud.
The takeaway
Bonsai Image 4B makes on-device text-to-image a one-afternoon project: run setup.sh, download_model.sh ternary, then generate.sh with a prompt, and you have a 512×512 image in seconds with no API key and no per-image bill. For anything interactive, serve.sh gives you a local web UI and API to build against. Start with the ternary variant on a Mac, generate across your real prompts, and decide from there whether the on-device quality clears your bar.
For the full release context and benchmark caveats, see the Bonsai Image 4B breakdown.
Sources
- PrismML-Eng/Bonsai-image-demo on GitHub — setup scripts, downloader, and all commands in this guide
- prism-ml/bonsai-image Hugging Face collection — weights (MLX / GemLite, 1-bit / 2-bit / ternary)
- PrismML — Bonsai Image 4B release post — specs and (self-reported) benchmarks
- Commands verified against the official demo repo on 2026-06-02; size/speed/quality figures are PrismML’s own — validate on your hardware and prompts.