How to Run Bonsai Image 4B Locally: On-Device Text-to-Image on Mac & PC

Read time: ~7 minutes. What you’ll learn: how to install and run PrismML’s Bonsai Image 4B on a Mac (Apple Silicon) or an NVIDIA PC, which quantized variant to download, the one-shot command to generate an image, and how to run the local Studio (a web UI + API). Every command is taken verbatim from PrismML’s official demo repo — copy-paste safe.

For years, shipping image generation in a product meant one of two things: pay an API per image, or rent a GPU to host SDXL/Flux. Bonsai Image 4B changes the math — it’s a text-to-image diffusion model compressed to under ~1.2 GB that runs entirely on your own device, generating a 512×512 image in about 6 seconds on an M4 Pro Mac. Apache 2.0, weights and code open.

This is the hands-on guide to running it locally. For the full benchmark story — sizes, compression ratios, and the honest caveats — see the Bonsai Image 4B release breakdown. Here we’re getting images coming out of your own machine.


1. What you’re running

Bonsai Image 4B is a 4B-parameter diffusion transformer that PrismML quantized down to extreme low-bit precision. Two variants ship (numbers are PrismML’s own, self-reported):

VariantSizeCompression vs FP16Quality retention
Ternary (1.58-bit)1.21 GB6.4×up to 95%
Binary (1-bit)0.93 GB8.3×up to 95%

On real hardware PrismML reports ~6 s per 512×512 image on an M4 Pro Mac and ~9.4 s on an iPhone 17 Pro Max. The license is Apache 2.0 for both weights and inference code — commercial-use friendly with no redistribution clauses.

Which to pick: start with ternary (1.58-bit) — it’s the recommended default and the better-quality variant, and 1.21 GB is already tiny. Drop to the 1-bit binary only if you need the absolute smallest footprint (e.g. bundling into a mobile app).


2. Pick your backend

Bonsai runs on three setups, each with its own kernel backend:

  • macOS (Apple Silicon) — uses the MLX backend via mflux. This is the smoothest path and where the ~6 s number comes from.
  • Linux / Windows (NVIDIA GPU) — uses GemLite + HQQ kernels for the low-bit math.
  • Windows — runs natively, no WSL2 required, via triton-windows.

Pick the one matching your machine; the setup script auto-detects and installs the right backend.


3. Install

Clone the demo repo and run the setup script for your OS.

macOS / Linux:

./setup.sh

Windows (PowerShell):

Set-ExecutionPolicy -Scope CurrentUser RemoteSigned
.\setup.ps1

setup.sh / setup.ps1 installs the Python dependencies and the correct kernel backend (MLX/mflux on Mac, GemLite+HQQ on NVIDIA, triton-windows on Windows) for your hardware.


4. Download the weights

The repo ships a downloader that pulls the quantized weights from Hugging Face. Grab the ternary variant (the recommended default):

./scripts/download_model.sh ternary        # default — 1.58-bit, best quality
./scripts/download_model.sh binary         # 1-bit variant — smallest
./scripts/download_model.sh --model binary-gemlite  # explicit backend

The weights live in the prism-ml/bonsai-image Hugging Face collection, which also has MLX 2-bit/1-bit and GemLite 2-bit/1-bit packs. The downloader picks the right pack for the backend setup.sh installed.


5. Generate your first image

The fastest way to confirm everything works — a one-shot generation straight from the CLI:

macOS / Linux:

./scripts/generate.sh --prompt "An icy Bonsai tree, in a rainy forest with a snowy mountains in the background, photo realistic."

Windows:

.\scripts\generate.ps1 -p "An icy Bonsai tree, in a rainy forest with a snowy mountains in the background, photo realistic."

That writes a 512×512 image to disk. To control size, seed, and output path:

./scripts/generate.sh -p "..." --size 1248x832 --seed 9909 --output outputs/icy_bonsai.png

Two things to know about dimensions: the default resolution is 512×512, and any custom dimensions must be multiples of 32 (e.g. 1248x832 works, 1250x830 won’t). The --seed flag makes a generation reproducible — same prompt + same seed = same image, which is essential when you’re iterating on a prompt.


6. Run the local Studio (web UI + API)

For interactive use, the repo bundles a Studio: a FastAPI backend plus a Next.js frontend.

./scripts/serve.sh   # FastAPI backend on :8000 + Next.js frontend on :3000

Open http://localhost:3000 for the web UI, or hit the API on :8000 directly. This is the setup you’d use to build image generation into your own app — the FastAPI endpoint gives you a clean local API to call, with zero per-image cost and nothing leaving the machine.

To drive the running server from the command line:

./scripts/send_request.sh -p "An icy bonsai tree..." --size 1248x832 --seed 9909

7. On-device / iPhone notes

PrismML’s headline claim is phone-class inference — ~9.4 s per image on an iPhone 17 Pro Max. A few practical caveats if you’re targeting mobile:

  • iPhone inference uses the MLX 2-bit variant from the Hugging Face collection, and requires iOS 18+ on an Apple Silicon device.
  • There’s no published Android port yet — this is an Apple-ecosystem story for now.
  • The Mac MLX path (§2–§6 above) is the one to develop against first; it’s the same backend family and far easier to iterate on than a device build.

8. Local vs API — when each wins

Run Bonsai locally when you’re shipping image-gen inside a Mac/iOS/desktop app and want zero per-image cost, when images can’t leave the user’s device (privacy), or when you want offline generation. The ~6 s latency on an M4 Pro is slower than a cloud API round-trip, but fine for “user clicks generate and waits” UX.

Stay on an API (OpenAI / Gemini Imagen / Stability, ~$0.02–$0.04 per image) when you need top-tier quality on hard prompts, when you don’t control the client device, or when latency must be sub-second. Be honest about the ceiling: at 4B, Bonsai’s quality won’t match a 12B Flux or a flagship cloud model — the win is “good enough, on-device, free per image,” not SOTA.

One more caveat worth repeating from the release coverage: the “up to 95% quality retention” figure and the comparison grids are PrismML’s own, and curated samples always look stronger than a random prompt set. Generate across your actual prompt mix before betting a product on it.

If you’re already building an on-device stack — a local LLM via llama.cpp or a local document-extraction model like NuExtract 3 — Bonsai Image slots into the same thesis: AI workloads on the user’s hardware, not the cloud.


The takeaway

Bonsai Image 4B makes on-device text-to-image a one-afternoon project: run setup.sh, download_model.sh ternary, then generate.sh with a prompt, and you have a 512×512 image in seconds with no API key and no per-image bill. For anything interactive, serve.sh gives you a local web UI and API to build against. Start with the ternary variant on a Mac, generate across your real prompts, and decide from there whether the on-device quality clears your bar.

For the full release context and benchmark caveats, see the Bonsai Image 4B breakdown.

Sources