How to Run Apertus Locally: The Fully-Open Swiss LLM (8B & 70B) in 7 Minutes

Read time: ~7 minutes. What you’ll learn: what Apertus is and why “fully open” is the headline, which size to pick (8B vs 70B) and the RAM/VRAM you need, and four ways to run it locally — Ollama (easiest), llama.cpp (GGUF), Hugging Face transformers (the official quickstart), and MLX on Apple Silicon — plus the chat-template settings, realistic benchmark expectations, and the licensing angle that makes Apertus interesting for builders.

Sourcing note: the transformers code below is quoted from the official swiss-ai/Apertus-8B-Instruct-2509 model card. Model facts (15T tokens, 1,800+ languages, training transparency) come from the Apertus paper, arXiv 2509.14233. Where a number isn’t published by the project (e.g. exact VRAM), it’s flagged as a standard estimate, not an official spec. Ollama paths use a community GGUF, called out as such. Links at the bottom.

If you’ve seen “Apertus” trending and wondered whether it’s worth running locally, here’s the short version: it’s one of the few genuinely, fully open large language models — not just open weights, but open everything — and it runs on the same local stacks you already use for Qwen, Gemma, or Llama. This guide gets you from zero to a running model four different ways, then tells you honestly what to expect from it.


1. What Apertus actually is (and why “fully open” is the point)

Apertus is a family of language models from the Swiss AI Initiative — a collaboration led out of EPFL and ETH Zürich with 100+ contributors (per the paper). It ships in two sizes:

  • Apertus-8Bswiss-ai/Apertus-8B-Instruct-2509
  • Apertus-70Bswiss-ai/Apertus-70B-2509

Most “open” models give you weights and nothing else. Apertus is different, and that’s the whole pitch:

  • Trained on 15T tokens across 1,800+ languages, with roughly 40% of pretraining on non-English content. That multilingual breadth is unusual at this size.
  • Pretrained exclusively on openly available data, retroactively respecting robots.txt opt-outs, with filtering for non-permissive, toxic, and personally identifiable content.
  • A “Goldfish objective” during pretraining that suppresses verbatim memorization of training data — a deliberate copyright/privacy mitigation.
  • Every scientific artifact released: data-preparation scripts, intermediate checkpoints, evaluation suites, and training code — so the model is auditable and reproducible, not just downloadable.
  • The instruct releases are under Apache 2.0 — commercially usable.

So the reason to care isn’t “it tops the leaderboard” (it doesn’t — more on that in §7). It’s that for anyone who needs a model whose provenance they can actually defend — compliance-sensitive shops, public-sector work, research — Apertus is about as clean as it gets. Running it locally keeps that whole chain on your own hardware.


2. Which size, and what hardware you need

ModelBest forNotes
Apertus-8B-InstructLaptops, single consumer GPU, first tryThe one to start with
Apertus-70BWorkstations / multi-GPU / heavy quantizationMuch higher quality, much heavier

About memory: the Apertus model card does not publish an official VRAM figure, so treat the following as the standard rule-of-thumb for any dense model of this size, not an Apertus-specific spec:

  • 8B at BF16 (full precision): ~16 GB of VRAM/unified memory for weights alone (plus headroom for context).
  • 8B at 4-bit (Q4) quantization: ~5–6 GB — comfortable on a 16 GB laptop, viable on 8 GB with short context.
  • 70B at 4-bit: ~40 GB+ — workstation/multi-GPU territory.

If you’re on a typical 16 GB machine, 8B quantized to Q4 via Ollama or llama.cpp is the path of least resistance. Start there.


3. Method 1 — Ollama (easiest)

Ollama is the fastest way to a chat prompt. Note: at the time of writing, the swiss-ai org publishes the weights on Hugging Face, and the convenient Ollama GGUF is maintained by a community packager (MichelRosselli/apertus), not by the Apertus team. It works well; just know it’s community-hosted.

# Install Ollama from https://ollama.com first, then:
ollama run MichelRosselli/apertus:8b-instruct-2509-q4_k_m

That pulls the 4-bit (Q4_K_M) build of Apertus-8B-Instruct and drops you into an interactive chat. To call it from code, Ollama exposes an OpenAI-compatible endpoint on http://localhost:11434:

curl http://localhost:11434/api/chat -d '{
  "model": "MichelRosselli/apertus:8b-instruct-2509-q4_k_m",
  "messages": [{"role": "user", "content": "Explain gravity in simple terms."}],
  "stream": false
}'

If you’d rather not trust a community build, use llama.cpp with a GGUF you convert yourself (next), or the official transformers path (§5).


4. Method 2 — llama.cpp (GGUF, most control)

llama.cpp is the engine under Ollama; using it directly gives you control over quantization, context length, and the server. There are community GGUF conversions of Apertus on Hugging Face (search the model name + “GGUF”); you can also convert the official weights yourself with llama.cpp’s convert_hf_to_gguf.py.

Once you have a .gguf file:

# Interactive CLI
./llama-cli -m apertus-8b-instruct-q4_k_m.gguf -p "Explain gravity in simple terms." -c 8192

# Or run an OpenAI-compatible server
./llama-server -m apertus-8b-instruct-q4_k_m.gguf -c 8192 --host 0.0.0.0 --port 8080

Then hit it like any OpenAI endpoint:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [{"role": "user", "content": "Explain gravity in simple terms."}],
  "temperature": 0.8,
  "top_p": 0.9
}'

The -c 8192 sets context window; Apertus supports long context (see §6), but larger windows cost memory.


5. Method 3 — Hugging Face transformers (the official quickstart)

This is the path the Apertus team documents themselves. It runs the official weights (no community repackaging) and is the right choice if you want full fidelity or plan to fine-tune. The following is quoted verbatim from the official model card:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "swiss-ai/Apertus-8B-Instruct-2509"
device = "cuda"  # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

A few things worth knowing about this snippet:

  • Apertus is natively supported in transformers (the model class is ApertusForCausalLM; AutoModelForCausalLM resolves it automatically). Make sure you’re on a recent transformers version so the architecture is recognized.
  • Use the chat template. apply_chat_template(...) is not optional — it formats the conversation the way the instruct model was trained. Note add_special_tokens=False when tokenizing the already-templated text, so you don’t double up special tokens.
  • Recommended sampling: the model card suggests temperature=0.8 and top_p=0.9. Add those to generate() for the intended behavior.
  • Dtype is BF16. On a 16 GB GPU you’ll likely want to load in 4-bit instead (via bitsandbytes, load_in_4bit=True) or use the Ollama/llama.cpp routes above.
  • max_new_tokens=32768 in the official example is generous — lower it for quick tests so you’re not waiting on a long generation.

6. Method 4 — MLX (Apple Silicon)

On an M-series Mac, mlx-lm is the most efficient route because it uses the unified-memory Metal backend instead of fighting CUDA assumptions.

pip install mlx-lm
mlx_lm.generate --model swiss-ai/Apertus-8B-Instruct-2509 \
  --prompt "Explain gravity in simple terms." \
  --max-tokens 512

If a native MLX build of the exact checkpoint isn’t available, the community MLX community frequently posts quantized conversions — again, check whether it’s an official or community upload before relying on it for anything sensitive. For an 8 GB / 16 GB Mac, a 4-bit MLX build of the 8B is the comfortable choice.

On older Intel Macs there’s no Metal acceleration for this, and full-precision 8B will be painfully slow — stick to a Q4 GGUF via llama.cpp/Ollama if you’re not on Apple Silicon or a discrete GPU. (We hit this exact wall doing local-model work on a 2019 Intel Mac.)


7. What context length and benchmarks to actually expect

Set expectations honestly, because this is where the “fully open” framing can oversell:

  • Context length: Apertus was pretrained at a 4,096-token context, then extended to support up to 65,536 tokens (64K). Real, but remember long context costs memory at inference — don’t set -c 65536 on a laptop and expect it to fit.
  • Quality: on MMLU (5-shot), the Apertus-8B SFT model scores around 60.9% (per independent benchmark evaluations; up to ~62.8% with curated preference tuning in follow-up work). That’s a competent 8B — in the same neighborhood as other good open 8B models — not a frontier model. The paper’s own framing is that Apertus “approaches state-of-the-art results among fully open models” on multilingual benchmarks. The operative phrase is among fully open models.

In plain terms: pick Apertus when openness, multilingual coverage, or provenance is the requirement. If you just want the single smartest local model and don’t care about training transparency, a same-size Qwen or Gemma may score higher on English reasoning — see our Qwen 3.6 local coding guide and Gemma 4 12B locally for those.


8. Troubleshooting

  • “Unknown architecture / ApertusForCausalLM not found.” Your transformers is too old. Upgrade (pip install -U transformers) so the Apertus class is registered.
  • Out of memory on load. You’re loading BF16 on too little VRAM. Switch to a 4-bit path (Ollama/llama.cpp GGUF, or load_in_4bit=True with bitsandbytes).
  • Garbled or repetitive output. You almost certainly skipped the chat template, or applied special tokens twice. Use apply_chat_template(...) and tokenize with add_special_tokens=False as in the official snippet.
  • Community GGUF behaves oddly. Quantized community conversions occasionally have template/tokenizer mismatches. If output looks wrong, cross-check against the official transformers path before blaming the model.

The takeaway

Apertus is the rare LLM that’s open all the way down — 15T tokens, 1,800+ languages, Apache-2.0, and every training artifact published by the EPFL/ETH Swiss AI Initiative. Running it locally is straightforward: Ollama (ollama run MichelRosselli/apertus:8b-instruct-2509-q4_k_m, community GGUF) for the fastest start, llama.cpp for control, the official transformers quickstart for full fidelity, and MLX on Apple Silicon. Start with the 8B at Q4 on a 16 GB machine. Just calibrate expectations: it’s a solid, exceptionally transparent 8B — choose it for openness, compliance, and multilingual reach, not for topping English-reasoning leaderboards.

For other local setups, see NuExtract 3 locally, Qwen 3.6 for local coding, and Gemma 4 12B locally.

Sources