spent yesterday trying to run qwen 3.5 9b locally on my mac mini m4 for my
@openclaw bot.
inference is fast (14 tok/s), tool calling works, ram is tight but manageable at 6.6gb.
one dealbreaker: ollama forces thinking ON for all qwen 3.5 models. 800 invisible <think> tokens before any output. 30-60 seconds of silence on every message, even "say hi."
what I tried:
→ think: false on /api/chat - works, but openclaw doesn't send it
→ reasoning_effort: "none" on /v1/ - disables thinking but breaks tool calling
→ node proxy to inject the param - works for raw calls, breaks streaming
→ system prompt tricks - qwen 3.5 ignores them
→ /nothink suffix - that's qwen 3, not 3.5
reverted to glm-5 for now. the model runs great, ollama just won't let me turn off thinking for small models despite qwen's own docs saying it should default to off.
if you're running qwen 3.5 locally with ollama and have thinking disabled with tool calling intact, what am I missing?
modelfile trick? different runner? different approach entirely?