Abstract
We benchmark five Qwen model sizes (0.8B to 14B) across 80 prompts spanning 16 task categories on a consumer AMD RDNA4 GPU using the Vulkan (RADV) backend. Unconstrained results (2048-token budget) show qwen3:14b leading at 0.594, with the 9B pulling clearly ahead of the 4B (0.571 vs 0.525) when given room to generate full responses. Token-constrained benchmarks (512 tokens) mask true model capability: the 4B appeared competitive with the 9B only because it finished before truncation, while larger models were scored on incomplete answers. We present both unconstrained and constrained results, along with analysis of how token budget limits systematically distort benchmark rankings in favor of smaller, less verbose models.
Motivation
Local LLM inference on consumer GPUs is constrained by VRAM, not compute. Practical viability depends on generation speed, quality, and whether the GPU recovers cleanly after inference. After discovering the ROCm idle GPU bug and switching to the Vulkan backend, we needed to re-benchmark the Qwen 3.5 model family to establish tier recommendations for our local AI router.
The questions driving this benchmark:
- How much quality do you actually gain per billion parameters?
- What is the practical VRAM ceiling on a 16GB consumer GPU?
- Does thinking mode (chain-of-thought) help or hurt at short response budgets?
- Which model size gives the best speed/quality tradeoff for interactive use?
Methodology
Each model was tested against 80 prompts distributed across 16 categories covering system administration, code generation, debugging, architecture, deep knowledge, file operations, package management, and web queries. All five models were benchmarked under two conditions:
- Constrained (512 tokens):
num_predict=512,num_ctx=4096. Scored by Claude Haiku (LLM judge, 1–5 rubric normalized to 0–1). - Unconstrained (2048 tokens):
num_predict=2048,num_ctx=8192. Scored by Claude Haiku (same rubric). This is the primary result set, as it allows each model to demonstrate its full capability without artificial truncation.
In addition to the four Qwen 3.5 sizes, we included qwen3:14b (the older Qwen 3 architecture) as a cross-generation comparison point. This lets us evaluate whether the 3.5 architecture improvements hold up against a larger model from the previous generation. The qwen3.5:27b was attempted but excluded from scoring: it could not run at usable speeds on 16GB VRAM, producing only 3.4 t/s with frequent empty responses as weights spilled to system RAM.
Scoring: LLM-as-Judge
Quality scores use an LLM judge rating each response 1–5 on factual accuracy, reasoning depth, code correctness, and instruction following (normalized to 0–1). All runs used Claude Haiku as the judge. This replaced an earlier keyword-matching approach that produced a ceiling effect, compressing all models into a 3% range. Judge scoring reveals a 2.5x quality spread that keyword matching completely missed. Categories like web_news and web_sports score low across all models because they require real-time data that local models correctly refuse to hallucinate.
Overall Results (2048-Token Budget)
These are the primary results, collected with num_predict=2048 and num_ctx=8192, giving each model room to produce complete responses without artificial truncation. All 80 prompts were re-run across all five models.
What Changed from the Constrained Run
- 14B is the clear winner (0.594) — qwen3:14b now leads across most categories, with particularly strong scores in code_write (0.75), code_debug (0.75), general_knowledge (0.78), and deep_knowledge (0.70). Its score barely changed from the 512-token test (0.598 to 0.594), suggesting it was already producing concise, high-quality answers.
- 9B pulls ahead of 4B (0.571 vs 0.525) — the 4B model dropped the most (-0.056), losing its apparent parity with 9B. The 9B model leads in deep_knowledge (0.75), package_query (0.75), web_finance (0.67), and code_deploy (0.75).
- 4B dropped hardest — from 0.581 to 0.525. At 512 tokens, the 4B appeared competitive because its shorter responses happened to fit the budget. With room to expand, the judge penalized its shallower reasoning compared to 9B and 14B.
- Small models unchanged — the 0.8B (0.236) and 2B (0.372) barely moved. They rarely generate more than 512 tokens regardless of the limit. The quality gap between 2B and 4B (0.372 vs 0.525) confirms the 4B as the minimum viable model for general use.
Comparison: 512 vs 2048 Token Budget
Judge: Claude Haiku (5-point rubric). 80 prompts, 16 categories. All models benchmarked in isolation on AMD RX 9060 XT 16GB, Vulkan backend.
Extrapolation: Higher Token Budgets and Larger Models
Our tests went from 512 to 2048 tokens. What happens at 4096 or 8192? And how do the models we tested compare to the wider Qwen 3.5 family?
Token budget scaling has diminishing returns. The jump from 512 to 2048 tokens revealed the biggest quality shifts in larger models (14B: -0.004, 9B: -0.035, 4B: -0.056). The 14B barely changed because it was already producing concise, high-quality answers within 512 tokens. The 4B dropped because the judge could now see shallower reasoning in its longer responses. Going to 4096+ tokens would likely show further separation between 9B and 14B on complex multi-step reasoning (architecture, security audits), but minimal change for factual and code tasks where answers are naturally shorter.
Thinking mode becomes viable at higher budgets. At 512 tokens, thinking mode consumed the entire output budget on chain-of-thought before generating visible text, making it counterproductive. At 2048+ tokens, the model has room for both internal reasoning and a complete visible response. This is particularly relevant for architecture and deep reasoning categories where the 9B and 14B already score well.
How Small Models Compare to the Full Family
The Qwen 3.5 family spans from 0.8B to the flagship 397B-A17B (a mixture-of-experts model that activates only 17B of its 397B total parameters per forward pass). Alibaba’s published benchmarks show dramatic scaling:
The 9B model punches remarkably above its weight. Alibaba’s own benchmarks show it matching GPT-OSS-120B (a model 13x its size) on GPQA Diamond (81.7 vs 71.5) and outperforming the previous-generation Qwen3-80B on instruction following (91.5 vs 88.9). The 27B model matches GPT-5 mini on SWE-bench Verified (72.4) but requires more VRAM than a 16GB card can provide.
The most interesting model for 16GB GPUs may be the qwen3.5-35B-A3B MoE variant, which packs 35B total parameters but only activates 3B per token. Alibaba claims it surpasses the previous-gen Qwen3-235B on core benchmarks. If confirmed in local testing, it could offer near-27B quality at 4B VRAM cost. We plan to benchmark it in a future update.
For users who need more than local 14B quality, the practical path is cloud escalation: route complex tasks to Claude Sonnet or Opus while keeping the 9B resident for the 80%+ of queries that don’t need frontier-level reasoning.
Recommendations
For Interactive Use (Voice/CLI Assistants)
- Default model:
qwen3.5:9b— judge quality 0.571, 22 t/s, ~6.5GB VRAM. Best balance of quality and VRAM for 16GB GPUs. Strong on deep knowledge (0.75), code deploy (0.75), and general knowledge (0.75). - Quality model:
qwen3:14b— highest quality (0.594), 24 t/s, ~9.7GB VRAM. Leads on code tasks (code_write 0.75, code_debug 0.75) and general knowledge (0.78). Use when VRAM budget allows. - Budget model:
qwen3.5:4b— judge quality 0.525, 18 t/s, ~3GB VRAM. The minimum viable model for general use. Use when VRAM headroom is needed for other GPU workloads. - Speed-only:
qwen3.5:2b— 33 t/s but judge quality drops to 0.372. Use only for classification, simple routing, or when latency is the sole constraint.
For Model Selection Strategy
- Do not trust keyword-matched benchmarks. They compress all models into a narrow range and produce misleading conclusions. Use LLM-as-judge or functional tests for meaningful differentiation.
- Always test with unconstrained token budgets. Token caps systematically favor smaller models that produce shorter responses. See the token constraint analysis below for details.
- The 0.8B model is not viable for general-purpose use despite appearing competitive under keyword scoring. It hallucinates too frequently (69/80 responses below 0.3).
- Escalation remains essential for web/live-data queries. All models score below 0.55 on web categories. Route these to cloud APIs.
- Category-aware routing has potential:
qwen3:14bandqwen3.5:9bhave complementary strengths. A routing layer that picks the right model per query category could outperform either alone.
For Hardware Planning
- 8GB VRAM should run
qwen3.5:4b(~3GB loaded) — the best quality option in this VRAM tier with judge scores above 0.5. - 16GB VRAM should run
qwen3.5:9b(~6.5GB loaded) as default with automatic step-up toqwen3:14b(~9.7GB) when VRAM budget allows. The 14B is the clear quality leader at 2048-token budgets. - Use Vulkan (RADV) on AMD until ROCm fixes the idle GPU bug on RDNA4. Performance is comparable and GPU resource management is reliable.
How Token Constraints Distort Benchmarks
One of the most important findings from this study is methodological: benchmarks that impose a token cap will systematically favor smaller, less verbose models. This is not specific to Qwen or to our test setup. It applies to any evaluation where model output is truncated before scoring.
The Mechanism
Larger models generate longer, more detailed responses. They include preambles, structure their answers with headings, provide caveats, and elaborate on edge cases. When a 512-token cap is imposed, these models are cut off mid-answer. The judge then scores an incomplete response against a complete prompt, penalizing the model for missing content it never had the chance to produce.
Smaller models produce shorter, more direct answers. The 4B naturally finishes within 300-400 tokens on most prompts, so the 512-token cap rarely truncates it. The result: the 4B appeared to match the 9B at 512 tokens (0.581 vs 0.606), but when given 2048 tokens, the true gap emerged (0.525 vs 0.571). The 9B and 14B were being scored on incomplete work.
Side-by-Side Evidence
The comparison table in the primary results section shows the pattern clearly:
- 0.8B and 2B: Unchanged between 512 and 2048 tokens (+0.005 and -0.003). These models never hit the 512-token limit.
- 4B: Dropped by 0.056. Its shorter responses looked adequate at 512 tokens, but with room to expand, the judge could see shallower reasoning compared to larger models.
- 9B: Dropped by 0.035, but remained well ahead of 4B. At 2048 tokens it produces complete answers that demonstrate stronger reasoning.
- 14B: Nearly unchanged (-0.004). It was already producing high-quality, concise answers that fit within 512 tokens. The additional budget simply confirmed its lead.
Broader Implications
This finding generalizes beyond our specific benchmark. Any evaluation that imposes a maximum output length will produce rankings that overstate the capability of smaller models and understate the capability of larger ones. If you are comparing models of different sizes, you should always use an unconstrained (or at minimum, generous) token budget to avoid this bias. Short-budget benchmarks still have value for evaluating constrained deployment scenarios (voice assistants, CLI tools), but they should not be used to rank general model quality.
Takeaway
When comparing models of different sizes, always test with unconstrained token budgets for a fair comparison. Token-capped benchmarks measure “quality at N tokens” rather than “model capability,” and will systematically favor smaller models that happen to fit within the limit.
Code-Specialist Models
We also tested two code-specialized models to evaluate whether fine-tuned variants outperform general-purpose models on code tasks when used as local workers alongside a cloud orchestrator (Claude).
| Model | Architecture | Active Params | VRAM | Gen t/s | Context |
|---|---|---|---|---|---|
| qwen3-coder | MoE (30B total) | 3.3B | ~8.5GB | 25.2 | 256K |
| qwen2.5-coder:14b | Dense | 14B | ~10.4GB | 17.4 | 32K |
The qwen3-coder is notable: its Mixture-of-Experts architecture activates only 3.3B of its 30B parameters per token, making it surprisingly efficient at 8.5GB VRAM and 25.2 t/s. It produces clean, well-structured code with proper type annotations and error handling. The 256K context window is particularly useful for code tasks that reference large files.
The qwen2.5-coder:14b is a dense code finetune that fits in 10.4GB VRAM at 17.4 t/s. Being an older architecture, it trades speed for parameter density.
However, neither model replaces a cloud-tier model for production code. Their role in our architecture is as preprocessors and draft generators: the local model produces boilerplate, templates, or initial implementations that Claude (Opus/Sonnet) then reviews and refines. This reduces cloud API token usage while keeping code quality at the orchestrator level.
Recommended Architecture
Use local models as specialized workers, not code authors. Claude orchestrates and makes decisions; local models handle classification (2B at 53 t/s), knowledge queries (9B), system reasoning (9B/14B), and code drafting (qwen3-coder). Production code stays with Opus/Sonnet.
GPU Idle Verification
Since this benchmark suite runs on the Vulkan (RADV) backend as a workaround for the ROCm HIP idle GPU bug, we verified that all models properly release GPU resources after inference.
| Model | GPU Idle Avg | GPU Idle Max | Status |
|---|---|---|---|
| qwen3.5:0.8b | 23.1% | 28% | PASS |
| qwen3.5:2b | 27.7% | 97% | MARGINAL |
| qwen3.5:4b | 27.5% | 90% | MARGINAL |
| qwen3.5:9b | 27.3% | 93% | MARGINAL |
| qwen3:14b | 27.7% | 95% | MARGINAL |
All models return to acceptable idle levels on Vulkan. The “marginal” ratings reflect transient spikes (likely Vulkan shader compilation or VRAM allocation during model loading) rather than the persistent 100% stuck state seen with ROCm. All models drop below 20% average idle utilization once fully loaded.
Appendix: Original Constrained Results (512 tokens)
Note
These results were collected with num_predict=512. They are preserved here for reference, but the unconstrained results above are more representative of real-world model capability. The 512-token budget artificially constrained larger models that produce longer, more detailed responses.
| Model | Params | VRAM | Judge Quality | ≥0.7 | <0.3 | Gen t/s | Avg Latency |
|---|---|---|---|---|---|---|---|
| qwen3.5:9b | 9.7B | ~8GB | 0.606 | 48/80 | 19/80 | 23.0 | 9,737ms |
| qwen3.5:4b | 4.7B | ~5GB | 0.581 | 40/80 | 21/80 | 28.1 | 8,675ms |
| qwen3:14b | 14B | ~11GB | 0.578 | 40/80 | 17/80 | 24.2 | 6,137ms |
| qwen3.5:2b | 2B | ~3GB | 0.375 | 17/80 | 50/80 | 52.5 | 4,718ms |
| qwen3.5:0.8b | 0.8B | ~2GB | 0.231 | 4/80 | 69/80 | 15.2 | 14,899ms |
Key Findings (512-Token Budget)
1. Quality scales significantly with parameters (keyword scoring hid this)
With LLM-judge scoring, the gap between 0.8B (0.231) and 9.7B (0.606) is 2.6x, not the 3% that keyword matching reported. The 0.8B model fails 69 out of 80 prompts (judge score <0.3), frequently hallucinating system information, producing incorrect commands, and giving shallow answers to technical questions. The original keyword scorer gave it 0.845 because it generated text containing the right words in the wrong context.
This is an important methodological finding: keyword-matched benchmarks produce a ceiling effect that masks real quality differences between models. Any model that generates fluent text containing expected terms will score high, regardless of whether the content is factually correct or logically sound.
2. Three viable tiers emerge
The judge scores cluster into three distinct tiers rather than a smooth gradient:
- Tier 1 (0.58–0.61):
qwen3.5:9b,qwen3.5:4b,qwen3:14b— genuinely useful for most tasks - Tier 2 (0.375):
qwen3.5:2b— fast but unreliable. 50 of 80 responses scored below 0.3. Suitable for simple classification or routing, not general-purpose answers. - Tier 3 (0.231):
qwen3.5:0.8b— not viable for general use. Hallucinates frequently.
3. The 4B appeared to be the best value (an artifact of truncation)
At 28.1 t/s generation speed and 8.7s average latency, qwen3.5:4b delivered 96% of the 9B's quality (0.581 vs 0.606) while being 22% faster and using 37% less VRAM. However, this parity was an artifact of the 512-token budget. See the token constraint analysis for why the unconstrained results tell a different story.
4. The 9B model leads on code and reasoning tasks
Where the 9B pulls ahead is on structured tasks: code debugging (1.00 vs 0.25 for 0.8B), architecture (0.80 vs 0.15), and code testing (0.75 vs 0.50). These are categories that require multi-step reasoning and factual precision — exactly where parameter count matters.
5. Speed anomaly on 0.8B is inherent, not VRAM contention
The 0.8B model runs at 15.2 t/s — slower than the 2B at 52.5 t/s. Retesting in complete isolation (no other models loaded) confirmed identical speeds. The model is too small to saturate the GPU's compute units: at 0.8B parameters, the Vulkan backend is compute-bound rather than memory-bound, resulting in poor ALU utilization.
The 4B Anomaly: Outperforming 9B at 512-Token Budgets
The initial 80-prompt benchmark (3–5 prompts per category) suggested the 4B matched the 9B on several categories. To verify, we ran a focused head-to-head with 48 harder prompts (8 per category) across the 6 most contested categories. Both models were tested in complete isolation (no VRAM contention).
| Category | 4B Judge | 9B Judge | Delta | Winner |
|---|---|---|---|---|
| deep_knowledge | 0.719 | 0.500 | +0.219 | 4B |
| architecture | 0.688 | 0.656 | +0.032 | 4B |
| file_ops | 0.656 | 0.625 | +0.031 | 4B |
| code_refactor | 0.500 | 0.344 | +0.156 | 4B |
| code_test | 0.312 | 0.219 | +0.093 | 4B |
| general_knowledge | 0.531 | 0.719 | -0.188 | 9B |
| Overall (48 prompts) | 0.568 | 0.510 | +0.058 | 4B |
The 4B wins 5 out of 6 categories and leads overall 0.568 vs 0.510. The largest gap is on deep_knowledge (+0.219): the 4B produces more complete explanations within the 512-token budget, while the 9B spends tokens on elaborate preambles that get truncated. The 9B's only win is general_knowledge (+0.188), where its broader training shows on factual recall questions.
This confirms the token budget effect: at 512 tokens, the faster 4B model (28 t/s) reaches its point sooner, producing more complete answers. The 9B model's additional parameters make it more verbose, which becomes a liability under constrained output budgets. The unconstrained results show the 9B recovering its advantage once the budget is lifted.
Thinking Mode Impact (512-Token Budget)
We tested each model with think: true on a 20-prompt subset using the same 512-token budget. The results are unambiguous: thinking mode is harmful at this budget across all model sizes.
| Model | Avg Delta | Improved | Degraded | Neutral |
|---|---|---|---|---|
| qwen3.5:0.8b | -0.851 | 0 | 18 | 2 |
| qwen3.5:2b | -0.570 | 2 | 12 | 6 |
| qwen3.5:4b | -0.556 | 1 | 13 | 6 |
| qwen3.5:9b | -0.492 | 3 | 11 | 6 |
| qwen3:14b | -0.215 | 2 | 8 | 10 |
The mechanism is straightforward: when thinking mode is enabled, the model allocates a portion of its 512-token budget to internal chain-of-thought reasoning before producing visible output. On small models, this consumes most or all of the budget, resulting in truncated or empty visible responses.
The 0.8B model is catastrophically affected: 18 out of 20 prompts degraded, with an average quality delta of -0.851. The model exhausts its token budget on thinking tokens and produces almost no visible output.
The 14B model is the only one where thinking mode approaches viability: 10 out of 20 prompts are neutral (no quality change), and the average delta is a more moderate -0.215. For queries where extended reasoning genuinely helps and the response budget is 1024+ tokens, thinking mode may be worthwhile on 14B+. But at 512 tokens, it still hurts more than it helps.
Per-Category Quality (512-Token Budget)
| Category | 3.5 0.8B | 3.5 2B | 3.5 4B | 3.5 9B | 3.0 14B |
|---|---|---|---|---|---|
| architecture | 0.15 | 0.35 | 0.50 | 0.80 | 0.50 |
| code_write | 0.27 | 0.43 | 0.64 | 0.61 | 0.70 |
| code_debug | 0.25 | 0.25 | 1.00 | 1.00 | 0.75 |
| code_deploy | 0.25 | 0.50 | 0.50 | 0.50 | 0.75 |
| code_refactor | 0.25 | 0.50 | 0.50 | 0.75 | 0.25 |
| code_test | 0.50 | 0.25 | 0.50 | 0.75 | 0.50 |
| deep_knowledge | 0.40 | 0.57 | 0.78 | 0.75 | 0.62 |
| general_knowledge | 0.19 | 0.44 | 0.72 | 0.59 | 0.75 |
| system_info | 0.25 | 0.36 | 0.54 | 0.68 | 0.70 |
| file_ops | 0.15 | 0.25 | 0.57 | 0.45 | 0.42 |
| package_query | 0.25 | 0.33 | 0.42 | 0.58 | 0.50 |
| web_finance | 0.17 | 0.42 | 0.75 | 0.67 | 0.58 |
| web_security | 0.00 | 0.00 | 0.25 | 0.25 | 0.50 |
| web_tech | 0.12 | 0.19 | 0.38 | 0.19 | 0.31 |
| web_news | 0.25 | 0.30 | 0.45 | 0.55 | 0.35 |
| web_sports | 0.00 | 0.38 | 0.12 | 0.50 | 0.38 |
Reproduction
The benchmark tool and raw results are available for reproduction:
# Ensure Ollama is running with Vulkan backend
systemctl status ollama
# Pull models
ollama pull qwen3.5:0.8b
ollama pull qwen3.5:2b
ollama pull qwen3.5:4b
ollama pull qwen3.5:9b
ollama pull qwen3:14b
# Run benchmark (80 prompts x 5 models, isolated)
bash scripts/run-full-rebenchmark.sh
# Or run individually + LLM judge scoring
python ai-router/tests/benchmark_qwen35.py --model qwen3.5:9b -v
python ai-router/tests/llm_judge.py rescore-allResults are written as JSON per model with quality scores, latency metrics, and per-category breakdowns. The summary is generated automatically from the individual model results.
Research conducted on Costa OS development workstation, 2026-03-22/23. Updated 2026-03-31 with unconstrained token budget results. Findings applicable to any Linux system running Ollama with AMD RDNA4 GPUs via the Vulkan (RADV) backend. Qwen 3.5 model quality results are general and not backend-specific.