How to Run Ornith-1.0 Locally: The Open-Weight Coding Model That Writes Its Own Scaffold
Read time: ~7 minutes. What you’ll learn: what Ornith-1.0 is and why “self-scaffolding” is the headline, which size to pick (the 9B is the laptop one) and the memory you need, and five ways to run it locally — Ollama (easiest), llama.cpp (GGUF), LM Studio (GUI), Hugging Face transformers (the official quickstart), and a vLLM/SGLang server you can wire into a coding agent — plus the reasoning-model sampling settings, honest benchmark numbers, and the integration angle that makes Ornith interesting for builders.
Sourcing note: the transformers code and recommended sampling settings below are quoted from the official deepreinforce-ai/Ornith-1.0-9B model card. The Ollama and llama.cpp commands are quoted from the official Ornith-1.0-9B-GGUF card. Benchmark numbers are from DeepReinforce’s release. The GGUF used here is published by the official
deepreinforce-aiorg — not a community repackage. Where a number isn’t published (e.g. exact VRAM), it’s flagged as a standard estimate. Links at the bottom.
If you’ve seen “Ornith-1.0” trending and wondered whether it’s worth running locally, the short version: it’s a brand-new, MIT-licensed open-weight coding model family that’s genuinely good at agentic coding — the 9B posts 69.4% on SWE-Bench Verified — and the official team ships GGUF builds, so it runs on the same local stacks you already use for Qwen or Gemma. This guide gets you from zero to a running model five different ways, then tells you honestly what to expect.
1. What Ornith-1.0 actually is (and what “self-scaffolding” means)
Ornith-1.0 is a family of coding models from DeepReinforce, post-trained on top of Gemma 4 and Qwen 3.5 (both Apache 2.0). It ships in four sizes:
- Ornith-1.0-9B —
deepreinforce-ai/Ornith-1.0-9B(dense, the laptop one) - Ornith-1.0-31B —
deepreinforce-ai/Ornith-1.0-31B(dense) - Ornith-1.0-35B —
deepreinforce-ai/Ornith-1.0-35B(MoE) - Ornith-1.0-397B —
deepreinforce-ai/Ornith-1.0-397B(MoE flagship; an FP8 build is also published)
The interesting part is the training, not just the weights. Most RL-tuned coding models learn to produce better solutions inside a scaffold that a human designed (the prompt structure, the tool-call loop, the retry logic). Ornith uses RL to jointly learn both the solution rollout and the scaffold that drives it — DeepReinforce calls this self-scaffolding. By optimizing the scaffold and the solution together, the model discovers better search trajectories instead of being boxed into one a human picked. That’s the whole pitch: an agent that’s been trained on how to drive itself, not just what to output.
For builders the practical consequence is that Ornith is tuned for multi-turn, tool-using agentic work — the kind of loop a coding agent runs — rather than single-shot completion. It’s also a reasoning model: it emits an internal chain-of-thought wrapped in <think>...</think> before the final answer, which matters for how you parse its output (more in §6).
And it’s MIT licensed — globally accessible, commercially usable, no regional restrictions. Running it locally keeps your code on your own hardware.
2. Which size, and what hardware you need
| Model | Type | Best for | Notes |
|---|---|---|---|
| Ornith-1.0-9B | Dense | Laptops, single consumer GPU, first try | The one to start with |
| Ornith-1.0-31B | Dense | One bigger GPU / heavy quant | Mid step up |
| Ornith-1.0-35B | MoE | Workstation, more quality per active param | MoE = faster than its size suggests |
| Ornith-1.0-397B | MoE | Server / multi-GPU, flagship quality | FP8 build available |
The 9B is the realistic local target for most people. Its official GGUF sizes (from the model card) tell you exactly what you’ll download and roughly what you need to hold in memory:
| Quantization | File size |
|---|---|
| Q4_K_M | 5.63 GB |
| Q5_K_M | 6.47 GB |
| Q6_K | 7.36 GB |
| Q8_0 | 9.53 GB |
| BF16 | 17.9 GB |
About memory: the model card doesn’t publish an official VRAM figure, so treat this as the standard rule-of-thumb, not an Ornith-specific spec. For the 9B at Q4_K_M (~5.6 GB of weights) you want roughly 8 GB of VRAM/unified memory to leave headroom for context — comfortable on a 16 GB laptop, viable on 8 GB with a modest context window. Full BF16 (~19 GB) is discrete-GPU territory. If you’re on a typical machine, 9B at Q4_K_M via Ollama or llama.cpp is the path of least resistance. Start there.
3. Method 1 — Ollama (easiest)
Ollama is the fastest way to a chat prompt, and because DeepReinforce publishes the GGUF on Hugging Face, you can pull it directly. This command is quoted from the official GGUF model card:
# Install Ollama from https://ollama.com first, then:
ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M
That pulls the 4-bit build straight from the official repo and drops you into an interactive chat. Swap the tag (:Q5_K_M, :Q6_K, :Q8_0) if you have the memory and want more quality.
To call it from code, Ollama exposes an OpenAI-compatible endpoint on http://localhost:11434:
curl http://localhost:11434/api/chat -d '{
"model": "hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M",
"messages": [{"role": "user", "content": "Write a Python function that merges two sorted lists."}],
"options": {"temperature": 0.6, "top_p": 0.95, "top_k": 20},
"stream": false
}'
Note the sampling options — those are the model’s recommended settings (see §6). Because Ornith is a reasoning model, expect a <think> block in the response before the actual code.
4. Method 2 — llama.cpp (GGUF, most control)
llama.cpp is the engine under Ollama; using it directly gives you control over context length, quantization, and the server. With the -hf flag it pulls the official GGUF for you — no manual download. These commands are quoted from the official card:
# Interactive CLI — pulls the official GGUF directly
./llama-cli -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M
# Or run an OpenAI-compatible server
./llama-server -hf deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M
Once llama-server is up, hit it like any OpenAI endpoint:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"messages": [{"role": "user", "content": "Refactor this loop into a list comprehension: result = []\nfor x in items:\n if x > 0:\n result.append(x * 2)"}],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20
}'
Add -c 16384 (or higher) to llama-server to widen the context window — Ornith supports a very long context (§7), but larger windows cost memory, so size it to your hardware rather than maxing it out.
5. Method 3 — LM Studio (GUI, no terminal)
If you’d rather not touch the command line, LM Studio is the simplest GUI route. Install it from lmstudio.ai, search the model catalog for Ornith-1.0, and download a quant that fits your RAM (start with the 9B Q4_K_M). LM Studio gives you a chat window plus a local OpenAI-compatible server you can toggle on for code.
One real-world data point: in independent testing of the 35B build, Simon Willison reported ~103 tokens/second in LM Studio with multi-turn agentic tool use working — a useful sign that even the MoE builds are practical on capable consumer hardware. (That’s a third-party measurement on his machine, not an official spec; your speed depends on your GPU/quant.)
6. Method 4 — Hugging Face transformers (the official quickstart)
This is the path DeepReinforce documents themselves, running the official weights with no repackaging. It’s the right choice if you want full fidelity or plan to fine-tune. The following is quoted verbatim from the official model card:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deepreinforce-ai/Ornith-1.0-9B")
model = AutoModelForCausalLM.from_pretrained(
"deepreinforce-ai/Ornith-1.0-9B",
dtype="auto",
device_map="auto",
)
A few things worth knowing:
- Use a recent transformers (the model card calls for ≥5.8.1) so the architecture is recognized.
- It’s a reasoning model. Ornith emits a
<think>...</think>block before its final answer. When you build the prompt with the tokenizer’s chat template and generate, you’ll need to strip or separate the think block from the final code in your own post-processing — don’t feed the raw<think>content back as if it were the answer. - Recommended sampling: the card specifies temperature 0.6, top_p 0.95, top_k 20 for normal use (or temperature 1.0 to reproduce the reported benchmark numbers). Pass these into
generate(). - Dtype is BF16 (
dtype="auto"resolves it). On a 16 GB GPU you’ll likely want a 4-bit load instead (load_in_4bit=Truevia bitsandbytes) or the Ollama/llama.cpp routes above.
7. Method 5 — Serve it to a coding agent (vLLM / SGLang)
This is where Ornith earns its keep: it was built for agents, so the highest-value local setup is an OpenAI-compatible server your coding agent talks to. The model card documents two production servers:
- vLLM (≥0.19.1) — supports Ornith’s tool-call and reasoning parsers.
- SGLang (≥0.5.9) — OpenAI-compatible endpoint with the
qwen3_codertool parser (Ornith inherits Qwen 3.5’s tool-call format).
A typical vLLM launch looks like:
pip install "vllm>=0.19.1"
vllm serve deepreinforce-ai/Ornith-1.0-9B \
--enable-auto-tool-choice \
--reasoning-parser <as documented on the model card> \
--port 8000
Check the model card for the exact
--tool-call-parser/--reasoning-parserflag values for your vLLM version before launching — parser names change between releases, and getting them right is what makes tool calls and<think>separation work end-to-end.
Once it’s serving on http://localhost:8000/v1, point any OpenAI-compatible coding agent at it. DeepReinforce calls out integration with OpenHands, Hermes Agent, and OpenClaw; anything that accepts a custom base URL + model name works. That gives you a fully local agentic coding loop — no API bills, no code leaving your machine.
8. What benchmarks to actually expect
Set expectations honestly. These are DeepReinforce’s reported numbers:
| Model | SWE-Bench Verified | Terminal-Bench 2.1 | Notes |
|---|---|---|---|
| Ornith-1.0-9B | 69.4% | 43.1% (Terminus-2) / 40.6% (Claude Code harness) | Also 42.9% on SWE-Bench Pro |
| Ornith-1.0-397B | 82.4% | 77.5% | Flagship |
The headline is the 9B: 69.4% on SWE-Bench Verified is genuinely strong for a model this small — it’s the kind of score that was frontier-only not long ago, now running on a laptop. The 397B flagship at 82.4% is competitive with the best open models. Two honest caveats:
- These are the project’s own reported numbers; independent verification will follow as people reproduce them. The model card notes temperature 1.0 for benchmark reproduction, so day-to-day results at the recommended 0.6 will differ.
- A high SWE-Bench score reflects agentic problem-solving inside a harness, not raw single-shot completion quality. Ornith is built to be driven as an agent — that’s where it shines and where you should evaluate it.
If you mainly want a fast local coding completer rather than an agent, a same-size Qwen or Gemma is worth comparing — see our Qwen 3.6 local coding guide and Gemma 4 12B locally.
9. Troubleshooting
<think>block showing up in your output. That’s expected — Ornith is a reasoning model. Split on</think>and keep what follows as the final answer, or use a server (vLLM/SGLang) with the reasoning parser enabled so it’s separated for you.- “Unknown architecture” on load. Your
transformersis too old. Upgrade to ≥5.8.1 (pip install -U "transformers>=5.8.1"). - Out of memory. You’re loading BF16 (~19 GB for the 9B) on too little VRAM. Switch to a Q4_K_M GGUF (~5.6 GB) via Ollama/llama.cpp, or
load_in_4bit=Truewith bitsandbytes. - Tool calls aren’t firing. The parser flag is wrong or missing for your server version. Re-check the exact
--tool-call-parservalue on the model card against your vLLM/SGLang release — this is the single most common agent-integration snag. - Repetitive or low-quality output. Confirm your sampling: temperature 0.6, top_p 0.95, top_k 20. The chat template is applied automatically by Ollama/llama.cpp/LM Studio; if you’re using raw transformers, make sure you’re applying it.
The takeaway
Ornith-1.0 is a rare combination: a brand-new MIT-licensed open-weight coding family that’s specifically trained for agentic work, with the official team shipping GGUF so it runs on the stacks you already have. The standout is the 9B — 69.4% on SWE-Bench Verified, ~5.6 GB at Q4_K_M, running on a normal laptop. Start there with Ollama (ollama run hf.co/deepreinforce-ai/Ornith-1.0-9B-GGUF:Q4_K_M), reach for llama.cpp when you want control, use LM Studio if you prefer a GUI, the official transformers quickstart for full fidelity, and a vLLM/SGLang server to wire it into a coding agent for a fully local loop. Just remember it’s a reasoning model — handle the <think> block — and judge it as an agent, which is what it was built to be.
For other local setups, see Qwen 3.6 for local coding, Apertus locally, and Gemma 4 12B locally.
Sources
- deepreinforce-ai/Ornith-1.0-9B — Hugging Face model card — official transformers quickstart, recommended sampling (temp 0.6 / 1.0, top_p 0.95, top_k 20), 262,144 context, BF16 ~19 GB, transformers ≥5.8.1 / vLLM ≥0.19.1 / SGLang ≥0.5.9,
<think>reasoning format, OpenHands/Hermes/OpenClaw integration - deepreinforce-ai/Ornith-1.0-9B-GGUF — Hugging Face — official GGUF, exact quant file sizes,
ollama runandllama-cli -hf/llama-server -hfcommands - deepreinforce-ai/Ornith-1.0-397B — Hugging Face — flagship MoE weights (FP8 build also published)
- Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding — DeepReinforce blog — self-scaffolding RL method, family lineup (9B/31B dense, 35B/397B MoE), benchmark numbers (397B: 82.4 SWE-Bench Verified / 77.5 Terminal-Bench; 9B: 69.4 / 43.1), Gemma 4 + Qwen 3.5 base, MIT license
- DeepReinforce Releases Ornith-1.0 — MarkTechPost — independent coverage of the release and method
- Simon Willison — Ornith-1.0 — third-party note: ~103 tok/s for the 35B build in LM Studio with agentic tool use
- VRAM figures in §2 are standard size-based estimates (the model card does not publish official requirements). Benchmark numbers are DeepReinforce’s own reported results, pending independent reproduction. Verified June 30, 2026 — confirm current details on the official model card before relying on them.