How to Run NuExtract 3 Locally: vLLM, Templates & Document Extraction

Read time: ~9 minutes. What you’ll learn: how to stand up NuExtract 3 locally with either vLLM (an OpenAI-compatible server) or plain Transformers, the JSON template language that drives extraction, how to extract structured data from text, images, and multi-page PDFs, the document-to-Markdown mode, and when to reach for reasoning mode. Every command and code block is taken verbatim from NuMind’s official model card — copy-paste safe.

If you run any kind of document pipeline — invoices, receipts, contracts, forms — you’ve probably been paying GPT-4o or Gemini per page to turn images into JSON. NuExtract 3 is the release that makes self-hosting that workload realistic: a 4B open-weight vision-language model purpose-built for structured extraction that fits on a single consumer GPU, with an Apache 2.0 license and no per-token bill.

This is the hands-on guide to running it. For the benchmark story — why a 4B model beats Qwen3.5-9B at this task — see the NuExtract 3 release breakdown. Here we’re getting it running on your own hardware.


1. What you’re setting up

NuExtract 3 is a VLM fine-tuned on Qwen3.5-4B (4B parameters, BF16) with a 131,072-token context window. It does two things, both locally:

  • Structured extraction: feed it text or an image plus a JSON template, and it returns clean JSON matching that template.
  • Document-to-Markdown: feed it a document image with no template, and it returns Markdown (with HTML tables and LaTeX math preserved).

Both run through one model. The license is Apache 2.0, so commercial use is fine with no strings.


2. Hardware you need

NuExtract 3 is a 4B model in BF16, so the weights are roughly 8–9GB. In practice:

  • Standard deployment comfortably fits a modern GPU with headroom for images — an RTX 4090 (24GB), A100, or H100.
  • Tight on VRAM? A 16GB card works if you cap context and image count (the low-memory command in §3 does exactly that).
  • BF16 needs a modern GPU — NuMind specifically lists A100 / H100 / RTX 4090 class. Older cards without BF16 support will struggle.

The memory cost isn’t just the weights — it’s the weights plus the image tokens. Each page-sized image expands into a lot of vision tokens, so processing 99 images at once needs far more memory than processing one. The low-memory recipe is mostly about capping that.


vLLM gives you an OpenAI-compatible endpoint, which means any code or tool that already talks to the OpenAI API can point at NuExtract with a one-line base-URL change. This is the setup most people want.

Standard deployment:

vllm serve numind/NuExtract3 \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image": 99, "video": 0}' \
  --chat-template-content-format openai \
  --generation-config vllm \
  --max-model-len 131072 \
  --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'

Low-memory deployment (use this if the standard command OOMs — it caps context to 16K and images to 6 per request):

vllm serve numind/NuExtract-3 \
  --trust-remote-code \
  --limit-mm-per-prompt '{"image": 6, "video": 0}' \
  --chat-template-content-format openai \
  --generation-config vllm \
  --max-model-len 16384 \
  --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'

A few notes on the flags:

  • --trust-remote-code is required — the model ships custom modeling code.
  • --limit-mm-per-prompt sets how many images one request can carry. 99 is generous; drop it to save memory.
  • --speculative-config enables Multi-Token Prediction (MTP) for faster decoding. It’s on by default in both commands; if your vLLM build doesn’t support it, just delete that line.

Once it’s up, the endpoint lives at http://localhost:8000/v1 and speaks the OpenAI chat-completions protocol.


4. The template language (this is the actual skill)

Everything in NuExtract revolves around the JSON template — a JSON object where the values describe the type of data you want, not example data. The model fills in the real values and returns JSON in the same shape.

Here’s a full invoice template showing the range:

{
  "invoice_number": "verbatim-string",
  "invoice_date": "date",
  "total_amount": "number",
  "currency": "currency",
  "line_items": [
    {
      "description": "verbatim-string",
      "item_type": ["electronics", "clothing", "vehicle", "furniture", "other"],
      "quantity": "integer",
      "unit_price": "number",
      "total": "number"
    }
  ]
}

The type vocabulary:

Template valueMeaning
"verbatim-string"Copy the text exactly as it appears
"string"Paraphrased / abstracted text (model may reword)
"integer", "number"Numeric values
"date", "time", "date-time"Temporal values
["option1", "option2"]Enum — pick exactly one
[["A", "B", "C"]]Multi-select — pick any number
"currency", "country", "email"Specialized formats
["type"]Array of that type (repeat the object)

The key distinction most people miss: verbatim-string vs string. Use verbatim-string for things like invoice numbers, names, and SKUs where you need the exact characters; use string when you want the model to clean up or summarize. Missing fields come back as null or [] — the model won’t hallucinate a value that isn’t in the document.


5. Extract structured data from text

With the server running, this is a normal OpenAI call — the NuExtract-specific parts go in extra_body.chat_template_kwargs:

import json
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

template = {
    "store": "verbatim-string",
    "date": "date-time",
    "total": "number",
    "currency": ["USD", "EUR", "GBP", "JPY", "Other"],
    "items": [
        {
            "name": "verbatim-string",
            "price": "number"
        }
    ]
}

response = client.chat.completions.create(
    model="numind/NuExtract3",
    temperature=0.2,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Yesterday I bought apples and coffee at Trader Joe's for a total of $12.40."
                }
            ],
        }
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(template),
            "instructions": "Specify the time for the `date` entry only if it is present, otherwise only output the date component.",
            "enable_thinking": False
        }
    }
)

print(response.choices[0].message.content)

Two things to notice: the template is passed as a JSON string (json.dumps(template)), and instructions is a free-text field where you can add per-field guidance — here, telling the model how to handle a missing time component.


6. Extract from an image (receipts, screenshots)

The only change from text extraction is the message content: pass an image_url with a base64 data URL instead of text.

import json
import base64
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

image_base64 = encode_image("receipt.png")
data_url = f"data:image/png;base64,{image_base64}"

template = {
    "store": "verbatim-string",
    "date": "date-time",
    "total": "number",
    "payment_method": "verbatim-string"
}

response = client.chat.completions.create(
    model="numind/NuExtract3",
    temperature=0.2,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": data_url}
                }
            ],
        }
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(template, indent=4),
            "enable_thinking": False
        }
    }
)

print(response.choices[0].message.content)

That’s the whole receipt-to-JSON pipeline. No OCR step, no layout parsing — the VLM reads the pixels directly.


7. Multi-page PDF extraction

Real documents are multi-page PDFs. The pattern: rasterize each page to a PNG with PyMuPDF, then pass all the pages as a list of images in one request. NuExtract reads across all of them and returns a single merged JSON.

import base64
import json
import fitz  # pip install pymupdf
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

def pdf_to_png_data_urls(pdf_path, dpi=170):
    data_urls = []
    with fitz.open(pdf_path) as doc:
        for page in doc:
            pix = page.get_pixmap(dpi=dpi, alpha=False)
            png_bytes = pix.tobytes("png")
            png_base64 = base64.b64encode(png_bytes).decode("utf-8")
            data_urls.append(f"data:image/png;base64,{png_base64}")
    return data_urls

data_urls = pdf_to_png_data_urls("invoice.pdf", dpi=170)

template = {
    "invoice_number": "verbatim-string",
    "invoice_date": "date",
    "total": "number",
    "currency": "currency",
    "line_items": [
        {
            "description": "verbatim-string",
            "quantity": "number",
            "unit_price": "number",
            "total": "number"
        }
    ]
}

response = client.chat.completions.create(
    model="numind/NuExtract3",
    temperature=0.2,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": data_url}
                }
                for data_url in data_urls
            ],
        }
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(template, indent=4),
            "enable_thinking": False
        }
    }
)

print(response.choices[0].message.content)

dpi=170 is a sane default — high enough to read small print, low enough to keep the image-token count manageable. If you’re on the low-memory deployment (6 images max), batch long PDFs in chunks of ≤6 pages.


8. Document-to-Markdown mode

Drop the template entirely and set "mode": "markdown" to convert a document image into clean Markdown — useful for feeding scanned docs into a RAG pipeline or a Markdown-based knowledge base.

import base64
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

image_base64 = encode_image("document.png")
data_url = f"data:image/png;base64,{image_base64}"

response = client.chat.completions.create(
    model="numind/NuExtract3",
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": data_url}
                }
            ],
        }
    ],
    extra_body={
        "chat_template_kwargs": {
            "mode": "markdown",
            "enable_thinking": False
        }
    }
)

print(response.choices[0].message.content)

The output uses Markdown for text and headers, HTML for tables, LaTeX for math, and <figure> tags for images — so complex layouts survive the conversion instead of collapsing into a wall of text.


9. Reasoning vs non-reasoning mode

NuExtract has a thinking toggle. The tradeoff is speed vs. handling hard layouts:

ModeSettingsUse for
Non-thinking (default)enable_thinking=False, temperature=0.2Fast, deterministic extraction; production throughput
Thinkingenable_thinking=True, temperature=0.6Difficult documents, ambiguous fields, complex layouts

In thinking mode the model emits its reasoning inside <think>...</think> tags before the answer. Strip it like this:

result = response.choices[0].message.content

if "</think>" in result:
    reasoning = result.split("<think>")[1].split("</think>")[0]
    answer = result.split("</think>")[-1].strip()
else:
    reasoning = None
    answer = result

print(answer)

Default to non-thinking for production — it’s faster and the temperature is low enough to be near-deterministic. Reach for thinking mode only when a specific document class keeps failing.


10. Option B — Run it with Transformers (no server)

If you’d rather run inference inline in a script — no separate server process — use Transformers directly. Same model, same template language; you just call it as a function.

import json
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "numind/NuExtract3"

processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
).eval()

def run_nuextract(messages, **chat_template_kwargs):
    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
        **chat_template_kwargs,
    ).to(model.device)

    with torch.inference_mode():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=4096,
            do_sample=False,
        )

    generated_ids = generated_ids[:, inputs.input_ids.shape[1]:]
    return processor.batch_decode(
        generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )[0].strip()

# Structured extraction
receipt_image = Image.open("receipt.png").convert("RGB")
receipt_messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": receipt_image,
            }
        ],
    }
]

template = {
    "store": "verbatim-string",
    "date": "date-time",
    "total": "number",
    "payment_method": "verbatim-string"
}

structured_output = run_nuextract(
    receipt_messages,
    template=json.dumps(template, indent=4),
    enable_thinking=False,
)
print(structured_output)

For Markdown mode with Transformers, pass mode="content" to run_nuextract instead of a template. Use vLLM for serving a pipeline; use Transformers for one-off scripts and experimentation.


11. Two shortcuts worth knowing

Generate a template from a sentence. You don’t have to hand-write templates. Ask the model in template-generation mode:

response = client.chat.completions.create(
    model="numind/NuExtract3",
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "I want to extract the key details from a rental contract."
                }
            ],
        }
    ],
    extra_body={
        "chat_template_kwargs": {
            "mode": "template-generation"
        }
    }
)

Convert a Pydantic model into a template. If you already define your schemas as Pydantic models, NuMind’s SDK converts them:

from typing import Literal
from pydantic import Field, BaseModel
from numind.nuextract_utils import convert_json_schema_to_nuextract_template

class HotelBooking(BaseModel):
    city: str
    check_in_date: str = Field(description="date")
    check_out_date: str = Field(description="date")
    number_of_guests: int
    room_type: Literal["single", "double", "suite"]

template, dropped_branches = convert_json_schema_to_nuextract_template(
    HotelBooking.model_json_schema()
)
# {'check_in_date': 'date', 'check_out_date': 'date', 'city': 'string',
#  'number_of_guests': 'integer', 'room_type': ['single', 'double', 'suite']}

This is the cleanest path if extraction feeds into typed application code — define the Pydantic model once, derive the template from it, and your output JSON already matches your data classes.


12. Self-host or call an API?

The honest decision framework. Self-host NuExtract 3 when document volume is high enough that per-page API costs add up, when documents can’t leave your infrastructure (legal/medical/financial compliance), or when you want a fixed-cost pipeline instead of a metered one. The model is small enough that one consumer GPU handles real throughput.

Stay on an API when volume is low, when you need the absolute strongest reasoning on messy edge-case documents, or when you don’t want to run GPU infrastructure at all. NuMind didn’t publish a head-to-head against GPT-4o or Claude on the same benchmark, so for the hardest documents, test both on your document mix before committing.

If you’re weighing the closed-model side, our Gemini 3.5 Flash vs Claude Haiku 4.5 comparison covers the cheap-API tier NuExtract is positioned to displace. And if you’re building out a broader local stack, the local coding setup with llama.cpp walks through the quantization and VRAM math for running other open models on the same hardware.


The takeaway

NuExtract 3 turns document extraction from a metered API call into a local, fixed-cost pipeline you own. The whole workflow is: vllm serve the model, write a JSON template describing what you want, and POST your text, image, or PDF to an OpenAI-compatible endpoint. For high-volume or compliance-bound document work, that’s the difference between a per-page bill that scales with usage and a one-time GPU you already have. Start with the standard vLLM command, a simple template, and one receipt — you’ll have structured JSON coming out in under an hour.

For the benchmark context behind the model, see the NuExtract 3 release breakdown.

Sources