Abstract
When loading Qwen 3 or Qwen 3.5 models via Ollama using the ROCm/HIP backend on AMD RDNA4 GPUs (gfx1200), the GPU becomes permanently stuck at 100% utilization and maximum clock speed, even after inference completes and the model is unloaded. We trace the root cause to HSA runtime teardown in the ROCm stack, benchmark the Vulkan (RADV) backend as a drop-in replacement, and find comparable inference performance with correct GPU idle behavior.
The Bug
When loading Qwen 3 or Qwen 3.5 models via Ollama using the ROCm/HIP backend on AMD RDNA4 GPUs (gfx1200), the GPU becomes permanently stuck at 100% utilization and maximum clock speed, even after:
- The model finishes inference
- The model is explicitly unloaded (
keep_alive: 0) - The Ollama process is stopped (
systemctl stop ollama)
The GPU remains at max P-state (3.47GHz), drawing 66W+ at idle, until the system is rebooted or the amdgpu kernel module is reloaded (which kills the display server).
Symptoms
gpu_busy_percentreads 100% continuously via/sys/class/drm/card*/device/gpu_busy_percent- GPU clock locked at highest P-state (3.47GHz instead of 500MHz idle)
- Power draw stays at 66W+ (normal idle is ~15-20W)
- Desktop compositor (Hyprland/Wayland) becomes sluggish
- GPU-accelerated terminals (Ghostty) and shell bars (AGS) crash from starved GPU resources
- The bug persists even after killing all Ollama processes. The HIP runtime leaves the GPU in a stuck state
What Doesn't Work
ollama stop/keep_alive: 0: Model unloads but GPU stays stucksystemctl restart ollama: Ollama restarts but GPU never recovers- AMD GPU reset via sysfs (
/sys/class/drm/card*/device/gpu_reset). Not available on RDNA4 OLLAMA_LLM_LIBRARY=cpuenv var. Ollama ignores it and still loads on GPU- Only a full reboot or
amdgpumodule reload clears the stuck state
Not Affected
- Qwen 2.5 models (all sizes). Work correctly on ROCm, GPU idles normally
- Gemma 3 models. Work correctly on ROCm
- Any model on the Vulkan backend. All work correctly
Root Cause Analysis
The bug is in the HSA (Heterogeneous System Architecture) runtime teardown within the ROCm HIP stack. When Qwen 3/3.5 models run inference:
- The HIP runtime allocates GPU compute queues via the HSA API
- During inference, the GPU operates normally at high utilization
- When inference completes or the model is unloaded, the HIP runtime fails to properly release/destroy the compute queues
- The GPU hardware continues executing the orphaned queue dispatch, maintaining max clock and 100% utilization
- Even after the HIP process exits, the kernel-level HSA queue state persists in the
amdgpudriver - The only recovery is resetting the GPU at the driver level (module reload) or rebooting
This appears to be related to ROCm bug #5706 and the relatively new gfx1200 (RDNA4) support in ROCm 7.x. The bug is model-architecture-specific. Qwen3's architecture (likely its attention pattern or KV cache implementation in GGML) triggers a code path in the HIP runtime that doesn't clean up properly on RDNA4.
Why Qwen3 but not Qwen2.5?
The GGML/llama.cpp HIP kernels dispatch differently based on model architecture. Qwen3/3.5 uses a different attention mechanism that may invoke HIP graph capture or persistent kernel dispatch. Features that have known teardown issues on newer AMD GPU architectures. The specific interaction between:
- GGML's ROCm kernel dispatch for Qwen3's architecture
- rocBLAS kernel selection for gfx1200
- HSA queue management in ROCm 7.2.0
...creates a scenario where queue cleanup fails silently.
The Fix: Vulkan Backend
Discovery
Ollama supports multiple GPU backends via shared libraries:
libggml-hip.so: ROCm/HIP (AMD)libggml-vulkan.so: Vulkan (all GPUs)libggml-cuda.so: CUDA (NVIDIA)
On Arch Linux, these are separate packages: ollama-rocm, ollama-vulkan, ollama-cuda.
The Vulkan backend uses mesa's RADV driver instead of ROCm's HSA/HIP stack. RADV is the open-source Vulkan driver maintained by the Mesa project and has excellent RDNA4 support. It uses a completely different GPU resource management path that properly cleans up compute resources on unload.
Configuration
When both ollama-rocm and ollama-vulkan are installed, Ollama's backend selection logic hardcodes ROCm > Vulkan. The PreferredLibrary() method returns true for ROCm over Vulkan unconditionally. To force Vulkan, you must hide ROCm devices:
[Service]
Environment="OLLAMA_LLM_LIBRARY=vulkan"
Environment="ROCR_VISIBLE_DEVICES=NOT_A_DEVICE"
Environment="HIP_VISIBLE_DEVICES=-1"Then reload:
sudo systemctl daemon-reload
sudo systemctl restart ollamaOllama log confirms:
library=Vulkan compute=0.0 name=Vulkan0 description="AMD Radeon RX 9060 XT (RADV GFX1200)"Important
OLLAMA_LLM_LIBRARY=vulkan alone is not sufficient. Despite documentation suggesting this env var forces a specific backend, Ollama 0.18.2 still spawns the ROCm runner alongside the Vulkan one and ROCm wins the dedup. You must also set ROCR_VISIBLE_DEVICES=NOT_A_DEVICE and HIP_VISIBLE_DEVICES=-1 to prevent ROCm from discovering the GPU at all.
Benchmark Results
All tests on AMD RX 9060 XT 16GB, same prompt (“Explain TCP vs UDP in 3 sentences”), num_predict: 100, think: false.
| Model | Prompt Speed | Generation Speed | Total Time |
|---|---|---|---|
| qwen3.5:4b | 210 t/s | 29.6 t/s | 3.8s |
| qwen3.5:9b | 165 t/s | 21.3 t/s | 9.6s |
| qwen2.5:7b | 163 t/s | 26.3 t/s | 7.8s |
| Metric | ROCm (HIP) | Vulkan (RADV) |
|---|---|---|
| GPU % during inference | 100% | ~96% |
| GPU % model loaded, idle | 100% (stuck) | 15% |
| GPU % after unload | 100% (never recovers) | 14-16% (normal) |
| GPU clock after unload | 3.47GHz (max) | 1.65GHz (dropping to idle) |
| Power draw after unload | 66W+ (stuck) | 27-29W (normal idle) |
| Recovery after unload | Requires reboot | Immediate |
Vulkan delivers comparable inference performance to ROCm while completely avoiding the idle GPU bug. The generation speed difference is negligible for the models tested. Prompt processing is actually faster on Vulkan for some models.
Recommendations
For RDNA4 Users
- Use
ollama-vulkaninstead ofollama-rocmfor all local LLM inference - If you need ROCm for other workloads (PyTorch training, etc.), run Ollama separately with Vulkan while keeping ROCm available for other applications
- This bug may be fixed in future ROCm releases. Check ROCm 7.3+ changelogs
For Ollama Developers
- The
OLLAMA_LLM_LIBRARYenv var doesn't fully override backend selection when ROCm is installed. The dedup logic still prefers ROCm - Consider adding per-model backend selection or a GPU idle monitoring watchdog
- The
PreferredLibrary()hardcoding of ROCm > Vulkan should be revisitable, especially for GPUs where ROCm has known issues
For Linux Distribution Maintainers
- Ship
ollama-vulkanas the default Ollama GPU package for AMD systems ollama-rocmshould be an optional add-on, not the default- The Vulkan backend via mesa RADV is more universally reliable across AMD GPU generations
Reproduction Steps
- Install
ollama-rocmon an RDNA4 system (gfx1200) - Pull any Qwen 3 or Qwen 3.5 model:
ollama pull qwen3.5:4b - Run inference:
ollama run qwen3.5:4b "hello" - After response completes, check GPU:
cat /sys/class/drm/card*/device/gpu_busy_percent - Observe 100% even though no inference is running
- Unload the model:
curl localhost:11434/api/generate -d '{"model":"qwen3.5:4b","keep_alive":0}' - GPU remains at 100%
- Stop Ollama:
sudo systemctl stop ollama - GPU remains at 100%. Only reboot fixes it
Files Changed in Costa OS
scripts/ollama-manager.sh: Updated comments for Vulkan requirementinstaller/first-boot.sh: Auto-configures Vulkan backend on AMD GPU detectionpackages/ai.txt: Addedollama-vulkanas default, documented ROCm caveatknowledge/vram-manager.md: Updated model tiers (qwen3.5), added GPU backend sectionCLAUDE.md: Updated architecture to note Vulkan backend/etc/systemd/system/ollama.service.d/vulkan.conf: Systemd override (runtime config)
Research conducted on Costa OS development workstation, 2026-03-22. Findings applicable to any Linux system running Ollama with AMD RDNA4 GPUs.