Jun 18, 2026

How to Use MiniMax M3: API, Pricing, Coding Agent & Local Setup

Last updated: June 18, 2026. Read time: ~9 minutes. What you’ll learn: what MiniMax M3 actually is (428B total / 23B active, MiniMax Sparse Attention, 1M context, native multimodality), how to read its self-reported benchmarks, the fastest way to call it via the OpenAI-compatible API (copy-paste curl / Python / Node), the real pay-as-you-go pricing, how to point a coding agent at it, and what running it locally genuinely costs in hardware.

Sourcing note: every benchmark below is MiniMax’s own self-reported figure — verify on your own workload before trusting it. Every command, price, and model ID is taken from MiniMax’s official blog, the Hugging Face model card, and the official API docs — links at the bottom.

On June 1, 2026, MiniMax released M3 — and the pitch is specific: per their announcement, it’s “the first and only open-weight model” to bring frontier coding, a 1-million-token context window, and native multimodality together in one model. For builders, the interesting part isn’t the leaderboard. It’s that a model with these specs is open-weight and cheap to call: $0.30 per million input tokens on the standard tier.

This is a hands-on guide to actually using it — API first (because that’s how 95% of you will touch it), then coding-agent setup, then the local option for the few of you with workstation-class hardware.

1. What MiniMax M3 actually is

The specs that matter, from the Hugging Face model card:

428B total parameters, ~23B active per token. It’s a sparse Mixture-of-Experts. You pay 428B in disk/RAM but only ~23B in compute per token, so inference is far faster than the total size suggests.
MiniMax Sparse Attention (MSA) — a new attention architecture. This is the engine behind the long-context efficiency: MiniMax reports >9× prefill and >15× decode speedups vs M2 at 1M context, cutting per-token compute to 1/20 of the previous generation.
1M token context, native.
Natively multimodal — trained on text, image, and video input from the first step (not bolted on afterward).
License: minimax-community — note this is MiniMax’s own community license, not a standard permissive license like Apache 2.0 or MIT. If license terms matter for your use case, read it before you ship.

The one-line mental model: a 428B-class model that runs at 23B-active speed, reads a million tokens cheaply, and you can call for fractions of a cent.

2. The benchmarks (and how to read them)

These are MiniMax’s self-reported numbers from the official blog. Treat them as “competitive, verify on your stack” — not gospel.

Benchmark	MiniMax M3 (self-reported)
SWE-Bench Pro	59.0%
Terminal-Bench 2.1	66.0%
SWE-fficiency	34.8%
KernelBench (Hard)	28.8%
MCP Atlas	74.2%
PostTrainBench	0.37

A few honest caveats so you read these correctly:

MiniMax’s blog reports the SWE-Bench Pro 59.0% number without a head-to-head comparison table, so I’m not going to claim it “beats model X” — the official source doesn’t say that, and a lot of the secondary write-ups floating around invented comparison figures. If you see “M3 beats GPT-5.5 on SWE-Bench Pro,” check whether that’s in MiniMax’s own materials. It isn’t, as of this writing.
On PostTrainBench, MiniMax itself reports 0.37, slightly below Claude Opus 4.7’s 0.42 — i.e. it’s candid that the frontier closed models still lead on some agentic tasks.
The most concrete claim is a real demo, not a benchmark: MiniMax shows M3 optimizing a CUDA kernel from 7.6% to 71.3% of Hopper FP8 peak utilization (a 9.4× speedup), and reproducing a paper across “18 commits and 23 experimental figures.” Those are the kind of agentic, long-horizon tasks the 1M context is built for.

The takeaway for picking a model: M3’s story is “open-weight, frontier-ish coding, enormous cheap context.” If your workload is long-context agentic coding and cost-sensitive, that combination is the pitch. For the single hardest one-shot tasks, the closed frontier (Opus 4.8) still leads — see our Opus 4.8 guide.

3. The fastest way to use it: the API

M3’s API is OpenAI-compatible, so if you’ve ever called OpenAI, you already know the shape. Get a key from Account Management → API Keys on the MiniMax platform, then:

curl:

curl https://api.minimax.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "MiniMax-M3",
    "messages": [
      {"role": "user", "content": "Refactor this function to be O(n)..."}
    ],
    "max_completion_tokens": 500
  }'

Python (OpenAI SDK, just swap the base URL):

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.minimax.io/v1",
)

resp = client.chat.completions.create(
    model="MiniMax-M3",
    messages=[{"role": "user", "content": "Refactor this function to be O(n)..."}],
)
print(resp.choices[0].message.content)

Node.js:

const OpenAI = require('openai');

const client = new OpenAI({
  apiKey: 'YOUR_API_KEY',
  baseURL: 'https://api.minimax.io/v1',
});

Three details that aren’t obvious from the boilerplate:

Model ID is exactly MiniMax-M3 (case-sensitive).
The thinking parameter. M3 has a reasoning mode controlled by a thinking block. Per the API docs, when you omit it, adaptive thinking is on by default (and responses include thinking content). Valid values: "adaptive" (default) or "disabled". For fast, terse code edits where you don’t want the reasoning trace, turn it off:
```
{ "thinking": { "type": "disabled" } }
```
Recommended sampling settings from the model card: temperature=1.0, top_p=0.95, top_k=40. Worth setting explicitly for reproducibility.

[YOU] Run the curl above with your own key and paste a real response here before publishing if you want a “real output” block — I haven’t run it against your account, so I’m not fabricating sample output.

4. Pricing — the part that actually matters

This is where M3 gets interesting for high-volume work. From the official pay-as-you-go pricing, standard tier:

Input length	Input / M tokens	Output / M tokens	Cached read / M
≤ 512K tokens	$0.30	$1.20	$0.06
> 512K tokens	$0.60	$2.40	$0.12

The 50% discount is permanent, not a launch promo — the prices above already reflect it. (Plenty of secondary articles called it a “7-day launch discount.” It isn’t, per MiniMax’s own page.)
There’s a Priority tier at 1.5× standard ($0.45 / $1.80 for ≤512K) if you need lower latency / higher reliability.
Prompt caching is cheap — $0.06/M to read cached ≤512K input. For agentic loops that re-send a big system prompt every turn, that’s where your bill actually lives, so caching matters more than the headline rate.
The >512K context tier is, per the docs, “available in limited quantity for a limited time,” with broader availability “expected in the next few days.” If you’re planning a workload that genuinely needs the full 1M window, confirm current availability before you architect around it.

Prefer a flat subscription? MiniMax also sells Token Plans — Plus $20/mo, Max $50/mo, Ultra $120/mo — and bundles M3 into MiniMax Code, their agent product (more on that below).

Cost framing: at $0.30/M input, a 200K-token codebase context costs about 6 cents to read once, and prompt caching drops repeat reads to ~1.2 cents. For cost-sensitive agentic coding, that’s the whole reason to look at M3 over a pricier frontier API. Compare against DeepSeek V4 Pro’s price-cut agent math if you’re optimizing spend.

5. Wire it into a coding agent

Because the endpoint is OpenAI-compatible, any coding tool that lets you set a custom base URL and model can use M3. The recipe is always the same:

Base URL: https://api.minimax.io/v1
API key: your MiniMax key
Model: MiniMax-M3

That covers editors and agents that accept an “OpenAI-compatible” / “custom provider” config (Cline, Roo, Continue, and most CLI agents). Set those three fields, select the model, and you’re running M3 as your backend.

Two levers worth tuning for agent work:

Turn thinking off for tool-call-heavy loops ("thinking": {"type": "disabled"}) when you want fast, deterministic edits and don’t need the reasoning trace eating output tokens. Leave it on (or adaptive) for hard, multi-step refactors.
Lean on prompt caching. Agents re-send the same system prompt + file context every turn; at $0.06/M cached read, keeping your context stable between turns is the single biggest cost lever.

If you’re building your own harness rather than using an off-the-shelf agent, the same OpenAI-compatible contract applies — see how we wired a local OpenAI endpoint into a coding agent in the Qwen3.6 local-coding guide; the client code is identical, you’re just pointing base_url at MiniMax instead of localhost.

6. MiniMax Code (the official agent app)

If you don’t want to assemble your own setup, MiniMax ships MiniMax Code, a desktop agent app, at agent.minimaxi.com/download. It bundles M3 as the backend model and is the zero-config path — useful if you just want to try the model’s agentic behavior before committing to API integration. It’s covered by the Token Plans rather than metered per-token.

7. Running it locally (only if you really mean it)

Let’s be blunt: M3 is not a “run it on your MacBook” model. It’s 428B total parameters, and even aggressively quantized it needs workstation- or server-class memory. From Unsloth’s GGUF release:

Quant	RAM / VRAM needed
1-bit (UD-IQ1_M)	128 GB
2-bit (UD-IQ2_XXS)	134 GB
2-bit (UD-Q2_K_XL)	143 GB
3-bit (UD-IQ3_XXS)	159 GB
4-bit (UD-IQ4_XS)	208 GB
4-bit (UD-Q4_K_M)	264 GB
8-bit (Q8_0)	453 GB

Even the 1-bit quant wants 128 GB. That’s a 512GB Mac Studio / multi-GPU territory, not consumer hardware. Two more things to know before you try:

llama.cpp support is preliminary and not in a released build yet. You have to compile from a specific PR:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/24523/head:minimax-m3
git checkout minimax-m3
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-cli llama-server
./build/bin/llama-cli -hf unsloth/MiniMax-M3-GGUF:UD-IQ1_M

MSA isn’t supported in llama.cpp yet — inference “falls back to dense attention,” so you lose the long-context efficiency that’s M3’s whole point. For now, local llama.cpp gives you the weights but not the speed story.

If you’re running on proper inference infrastructure, vLLM and SGLang were supported from day one of the weight release — that’s the route for anyone actually self-hosting M3 in production. For most people, though: the API is cheaper than the electricity, and the API gets you MSA.

8. MiniMax M3 vs the open-weight coding field

M3 isn’t alone in the “open-weight, frontier-ish coding model” space in 2026. Quick orientation:

vs Qwen3.6-35B-A3B — Qwen3.6 is the one that actually runs on consumer hardware (23GB at 4-bit on a 32GB Mac). M3 is far more capable on paper but an order of magnitude heavier to self-host. Qwen3.6 for “local on my machine,” M3 for “cheap frontier-ish via API.”
vs Kimi K2.7 Code — both are giant open MoE coding models; Kimi K2.7 is a 1T open coding agent. If you’re comparison-shopping the open frontier, these are the two to benchmark on your tasks.
vs DeepSeek V4 Pro — DeepSeek’s story is the aggressive price cut and cache economics. M3’s edge is the native 1M multimodal context. Different bets; both cheap.
vs closed frontier (Opus 4.8) — MiniMax’s own PostTrainBench number (0.37 vs Opus 4.7’s 0.42) is the honest tell: on the hardest agentic work, the closed frontier still leads. See the Opus 4.8 guide.

9. When MiniMax M3 wins — and when it doesn’t

Reach for M3 when:

You have long-context, cost-sensitive agentic coding — huge codebases, long agent loops — and the $0.30/M + cheap caching makes the economics work.
You need multimodality (image/video input) in the same model as coding, without juggling two APIs.
You want open weights for compliance/portability reasons and have the infrastructure (vLLM/SGLang) to self-host.

Don’t, when:

You want to run locally on a normal machine — get Qwen3.6 instead; M3’s 128GB floor rules out consumer hardware.
You’re chasing the absolute top score on the hardest single tasks — the closed frontier still edges it, by MiniMax’s own admission.
License terms matter and minimax-community doesn’t fit — read it before building on it.

The takeaway

MiniMax M3 matters less for any one benchmark and more for the bundle: open-weight, frontier-ish coding, native multimodality, and a genuinely cheap 1M-token context at $0.30/M input. For most builders the move is simple — point the OpenAI SDK at https://api.minimax.io/v1, set the model to MiniMax-M3, turn thinking off for fast edits, and lean on prompt caching to keep the bill down. Self-hosting is real but workstation-class; the API is where the value is.

For the local-first alternative that runs on hardware you already own, see Run Qwen3.6 Locally for Coding. For the hosted-frontier comparison, see How to Use Qwen3.7-Max and the Opus 4.8 guide.

Sources

MiniMax M3 — official announcement blog — release date, architecture, MSA, self-reported benchmarks, speedup claims
MiniMax-M3 model card on Hugging Face — parameters (428B/23B active), license, recommended inference settings, supported engines
MiniMax Chat Completions API docs (OpenAI-compatible) — base URL, model ID, curl/SDK examples, thinking parameter
MiniMax pay-as-you-go pricing — per-token rates, tiers, caching, permanent 50% discount
unsloth/MiniMax-M3-GGUF — quantization memory requirements, llama.cpp build status
Specs and figures verified June 18, 2026; all benchmarks are MiniMax’s own self-reported numbers — verify on your workload.