Three ways to make Qwen3 stop thinking out loud

Qwen3 (and most modern reasoning-flavoured open models) loves to emit <think>…</think> blocks before producing the answer you actually wanted. That’s great if you’re a researcher. It’s expensive and slow if you’re using the model as a tool. There are three places you can put the off-switch — in the prompt, in the API parameters, or in the Modelfile — and only one of them survives every transport. This is which one, and why the others looked like they worked but didn’t.

The thing that ate my latency budget

I dropped a 4B Qwen3 model into a pipeline that asked it for short, structured extractions (“give me a 60-word summary of this chunk with the three key points and any action items”). The reference model I’d been using produced answers in roughly half a second. Qwen3 was producing answers in eight to fifteen seconds. The answers were better — Qwen3 is genuinely a smarter model — but they were not eight-times-better, and the pipeline downstream had a budget.

When I turned on raw API logging, the reason was obvious:

<think>
Okay, the user wants a 60-word summary of this chunk. Let me first read
through and identify the main themes. The speaker is discussing... [continues
for 800 tokens] ... so I should structure the output with the summary, then
the key points, then actions. Let me draft that now.
</think>

SUMMARY: ...
POINTS: ...
ACTIONS: ...

800 tokens of internal monologue, then the answer I’d asked for. I was paying — in latency and (if the model were hosted) in dollars — for the model to convince itself out loud, every single call.

There are three places you can tell Qwen3 to skip the thinking. They are not equivalent.

Option 1 — Prompt-level: `/no_think`

The first thing you’ll find when you search for this is a recommendation to add /no_think to your system prompt. It’s the cheapest fix to try and the easiest to roll back.

What it does, on paper: the model sees the directive and skips the <think> block.

What it actually does, in practice: sometimes. The behaviour depends on the runtime, the chat template, and what other text is around the directive in the system prompt. Open WebUI is particularly prone to a delightful failure mode where the model sees the literal string /no_think in the prompt and then thinks about how not to think — which is exactly the outcome you were trying to avoid, plus one layer of philosophical irony.

SYSTEM: /no_think

USER: Summarise this paragraph in 30 words.

ASSISTANT:
<think>
The user wants me to summarise without thinking. Let me consider how to
approach this. They've used the /no_think directive, which means I should
not show my reasoning. But to write a good summary I need to identify the
main points, which is itself a kind of thinking. Let me...
</think>

That’s a real response from a real session. The directive existed; the model noticed; the model thought about the directive. Net thinking budget: identical or worse.

Verdict: worth trying for a one-off. Not worth relying on for a production pipeline.

Option 2 — Parameter-level: the `think` flag

The cleaner fix, available in some clients but not all, is to pass a parameter that tells the runtime itself to suppress thinking — bypassing the model’s interpretation of the prompt.

In Open WebUI, this looks like:

+ Add Custom Parameter
    Key:   think
    Value: false

Behind the scenes, this sends a raw API flag to Ollama (or whichever backend is running the model) that disables the thinking output at the protocol level, not the prompt level.

What it does: strips <think> from the output stream.

What it doesn’t do: stop the model from generating the thinking. The flag controls visibility, not behaviour — depending on the runtime version, you may still be paying for the tokens even though they don’t appear in your final output.

Caveat: if you also have /no_think in your system prompt, the two can conflict. The parameter wins on output, but the prompt directive still confuses the model’s planning. Pick one and only one. (Spoiler: the parameter is the one.)

Verdict: good for client-side fixes. Reliable for what you can see; not reliable for what you’re paying for.

Option 3 — Modelfile-level: the permanent fix

The actually-correct fix, and the one I now use for production: build a custom Modelfile that removes the thinking template from the model itself.

# 1. Pull the model's current Modelfile
ollama show qwen3:4b --modelfile > Modelfile

# 2. Edit it. In particular:
#    - Remove or replace any TEMPLATE lines that emit <think> blocks
#    - Add a hard SYSTEM directive
nano Modelfile

The edited SYSTEM line:

SYSTEM "You are a helpful assistant. Do not generate thinking
processes or <think> tags. Answer the user directly with the
requested output format."

The TEMPLATE change is more subtle but matters more: most thinking-capable Qwen variants have a template line that explicitly inserts <think> before the model’s response. Remove that line entirely.

# 3. Build the customised model
ollama create qwen3-nothink:4b -f Modelfile

# 4. Use it instead of the base model
ollama run qwen3-nothink:4b

What it does: the model never produces a <think> block. The behaviour is at the model definition layer, so it survives prompts, parameters, and clients.

What it costs: disk (you now have two copies of the model) and discipline (you have to use the right tag in your runtime config).

Verdict: the only one of the three that I trust in a pipeline that has a real latency or cost budget.

The three options, on one table

	Where it lives	Reliable?	Survives transport?	When to use
`/no_think` in prompt	The chat	No	No	Quick ad-hoc test
`think: false` parameter	The client	Output yes, generation no	Mostly	Interactive use in Open WebUI
Custom Modelfile	The model itself	Yes	Yes	Production pipelines

What’s actually happening under the hood

It helps to remember that “thinking” in these models is learned behaviour from training, not a flag the model checks at inference time. The model is trained on examples that look like <think>…</think><answer>…</answer>, and at inference it does the same — because that’s what the training data taught it to do.

A prompt directive tries to bias the model away from that pattern using text input — the same channel as the user’s question. The model treats the directive as one more thing to consider, and considering things, of course, requires thinking. The recursive failure mode is built into the medium.

A runtime parameter intercepts the output stream and hides the thinking, but it can’t undo what the model is generating internally. You save the bandwidth of seeing the tokens; you don’t save the compute that produced them.

A Modelfile change rewrites the template the model uses to format its output, and in some cases (depending on how the base model was trained) actually changes the model’s behaviour at the token level — because the absence of <think> in the expected output shape biases the model toward not producing one.

                    PROMPT                PARAMETER             MODELFILE
                        │                     │                     │
                        ▼                     ▼                     ▼
                  ┌──────────┐          ┌──────────┐         ┌──────────┐
                  │ "stop"   │          │ runtime  │         │ template │
                  │ in text  │          │ filter   │         │ rewrite  │
                  └────┬─────┘          └────┬─────┘         └────┬─────┘
                       │                     │                    │
              model still thinks    model still thinks    model doesn't think
              user sees thinking    user doesn't see       (no <think> emitted)
              user pays for it      user pays for it      user pays for nothing

The Modelfile fix is the only one of the three that addresses behaviour rather than presentation. Everything else is downstream of “the model has decided to think.”

The pragmatic recommendation

If you’re prototyping, drop /no_think into your system prompt and move on — but verify with raw API output, not the rendered UI, that it actually worked. (It often hasn’t.)

If you’re using Open WebUI interactively, set think: false as a custom parameter. Cleaner. More reliable.

If you’re shipping anything to production where latency or token budget matter, build a custom Modelfile. It’s a five-minute exercise and it eliminates the entire class of failure. Tag it clearly (qwen3-nothink:4b), document why you did it, and move on with your life.

The model is great. The thinking is great. They’re just not great together for every job. Pick the one you actually need, and turn the other off in the place it can’t sneak back in.

Adjacent: The reasoning tax — same principle, model-selection edition.