NuExtract 3 released — a 4B open-weight VLM that beats 9B Qwen at structured extraction (Apache 2.0)
NuMind shipped NuExtract 3, a 4B vision-language model fine-tuned on Qwen3.5-4B for structured extraction + image-to-Markdown. Apache 2.0, 131K context, beats Qwen3.5-9B on NuMind's internal 600-doc benchmark. The best open-weight document-understanding model that fits on a single consumer GPU.
NuMind released NuExtract 3 today — a 4B open-weight vision-language model purpose-built for document understanding: structured JSON extraction from text or images, plus image-to-Markdown conversion. It’s already trending on r/LocalLLaMA, and the benchmark numbers explain why.
What it actually is
- Size: 4B parameters, fine-tuned on Qwen3.5-4B base
- License: Apache 2.0 (commercial use allowed, no strings)
- Context: 131,072 tokens
- Inputs: text, images, or text + images (multimodal, multilingual)
- Modes: reasoning and non-reasoning (toggle with
enable_thinking) - Two main pipelines:
- Structured extraction →
text/image + JSON template + instructions → JSON output - Image-to-Markdown →
text/image → clean Markdown
- Structured extraction →
The benchmark that matters
On NuMind’s internal structured-extraction benchmark across ~600 diverse documents (invoices, movie posters, floor plans, contracts):
| Model | Score | Failures |
|---|---|---|
| NuExtract 3 (4B) | 0.651 ± 0.019 | 27 |
| gemma-4-E4B-it | 0.538 | 31 |
| Qwen3.5-9B | 0.479 | 170 |
| Qwen3.5-4B (base) | 0.417 | 229 |
| GLM-4.6V-Flash | 0.435 | 153 |
| Ministral-3-3B | 0.240 | 344 |
The headline: NuExtract 3 (4B) beats Qwen3.5-9B by 17 points on this task while being less than half the size. The fine-tune is doing real work — this is what a domain-targeted post-training run looks like when you have ~600 hand-labeled documents from real customers.
Why builders should care
This is the first open-weight model in 2026 that I’d seriously consider deploying against GPT-4o / Claude / Gemini Flash for document extraction workloads:
- Self-hostable: 4B fits comfortably on a 12-16GB consumer GPU. vLLM deployment script is in the model card.
- Commercial license: Apache 2.0 means no procurement headaches.
- Real benchmark, not just synthetic: 600 documents including invoices and floor plans is the kind of long-tail mess that breaks general VLMs.
- Two-mode inference: reasoning mode for hard docs, non-reasoning for production throughput.
For comparison-shopping the closed-model alternatives that would compete on the same workload, see our Gemini 3.5 Flash vs Claude Haiku 4.5 deep dive — these are the API-tier models NuExtract 3 is positioning itself to displace for self-hosted use cases.
What’s still unclear
NuMind didn’t publish a direct comparison against GPT-4o or Claude on the same 600-doc benchmark, which is the comparison that would actually settle the “use the API or self-host?” question. The image-to-Markdown evaluation was scored via Gemini 3 Flash as judge on 100 complex documents — useful, but not the same as side-by-side human eval against closed-model OCR.
Still, for any builder running document-extraction pipelines today on the OpenAI / Anthropic / Google APIs: spinning up NuExtract 3 on your own hardware for a weekend and benchmarking against your real document mix is now a 1-hour exercise instead of a 1-week project. That’s the actual news.
Sources
- NuExtract 3 on Hugging Face — model card, benchmarks, deployment configs
- NuMind GitHub — numindai/nuextract
- NuExtract platform homepage
- r/LocalLLaMA discussion thread
Source: Hugging Face — numind/NuExtract3