Research
ToolsAMDROCmVulkanOllamaRDNA4GPU

ROCm HIP Idle GPU Bug on RDNA4: Vulkan Workaround

Kevin8 min read

Abstract

When loading Qwen 3 or Qwen 3.5 models via Ollama using the ROCm/HIP backend on AMD RDNA4 GPUs (gfx1200), the GPU becomes permanently stuck at 100% utilization and maximum clock speed, even after inference completes and the model is unloaded. We trace the root cause to HSA runtime teardown in the ROCm stack, benchmark the Vulkan (RADV) backend as a drop-in replacement, and find comparable inference performance with correct GPU idle behavior.

HardwareAMD Radeon RX 9060 XT 16GB (RDNA 4, gfx1200)
SoftwareArch Linux, kernel 6.19.8, Ollama 0.18.2, ROCm 7.2.0, mesa (RADV)
Affected Modelsqwen3, qwen3.5 (all sizes tested: 4b, 9b, 14b)
SeverityCritical . Renders the GPU unusable until reboot

The Bug

When loading Qwen 3 or Qwen 3.5 models via Ollama using the ROCm/HIP backend on AMD RDNA4 GPUs (gfx1200), the GPU becomes permanently stuck at 100% utilization and maximum clock speed, even after:

  1. The model finishes inference
  2. The model is explicitly unloaded (keep_alive: 0)
  3. The Ollama process is stopped (systemctl stop ollama)

The GPU remains at max P-state (3.47GHz), drawing 66W+ at idle, until the system is rebooted or the amdgpu kernel module is reloaded (which kills the display server).

Symptoms

  • gpu_busy_percent reads 100% continuously via /sys/class/drm/card*/device/gpu_busy_percent
  • GPU clock locked at highest P-state (3.47GHz instead of 500MHz idle)
  • Power draw stays at 66W+ (normal idle is ~15-20W)
  • Desktop compositor (Hyprland/Wayland) becomes sluggish
  • GPU-accelerated terminals (Ghostty) and shell bars (AGS) crash from starved GPU resources
  • The bug persists even after killing all Ollama processes. The HIP runtime leaves the GPU in a stuck state

What Doesn't Work

  • ollama stop / keep_alive: 0: Model unloads but GPU stays stuck
  • systemctl restart ollama: Ollama restarts but GPU never recovers
  • AMD GPU reset via sysfs (/sys/class/drm/card*/device/gpu_reset). Not available on RDNA4
  • OLLAMA_LLM_LIBRARY=cpu env var. Ollama ignores it and still loads on GPU
  • Only a full reboot or amdgpu module reload clears the stuck state

Not Affected

  • Qwen 2.5 models (all sizes). Work correctly on ROCm, GPU idles normally
  • Gemma 3 models. Work correctly on ROCm
  • Any model on the Vulkan backend. All work correctly

Root Cause Analysis

The bug is in the HSA (Heterogeneous System Architecture) runtime teardown within the ROCm HIP stack. When Qwen 3/3.5 models run inference:

  1. The HIP runtime allocates GPU compute queues via the HSA API
  2. During inference, the GPU operates normally at high utilization
  3. When inference completes or the model is unloaded, the HIP runtime fails to properly release/destroy the compute queues
  4. The GPU hardware continues executing the orphaned queue dispatch, maintaining max clock and 100% utilization
  5. Even after the HIP process exits, the kernel-level HSA queue state persists in the amdgpu driver
  6. The only recovery is resetting the GPU at the driver level (module reload) or rebooting

This appears to be related to ROCm bug #5706 and the relatively new gfx1200 (RDNA4) support in ROCm 7.x. The bug is model-architecture-specific. Qwen3's architecture (likely its attention pattern or KV cache implementation in GGML) triggers a code path in the HIP runtime that doesn't clean up properly on RDNA4.

Why Qwen3 but not Qwen2.5?

The GGML/llama.cpp HIP kernels dispatch differently based on model architecture. Qwen3/3.5 uses a different attention mechanism that may invoke HIP graph capture or persistent kernel dispatch. Features that have known teardown issues on newer AMD GPU architectures. The specific interaction between:

  • GGML's ROCm kernel dispatch for Qwen3's architecture
  • rocBLAS kernel selection for gfx1200
  • HSA queue management in ROCm 7.2.0

...creates a scenario where queue cleanup fails silently.

The Fix: Vulkan Backend

Discovery

Ollama supports multiple GPU backends via shared libraries:

  • libggml-hip.so: ROCm/HIP (AMD)
  • libggml-vulkan.so: Vulkan (all GPUs)
  • libggml-cuda.so: CUDA (NVIDIA)

On Arch Linux, these are separate packages: ollama-rocm, ollama-vulkan, ollama-cuda.

The Vulkan backend uses mesa's RADV driver instead of ROCm's HSA/HIP stack. RADV is the open-source Vulkan driver maintained by the Mesa project and has excellent RDNA4 support. It uses a completely different GPU resource management path that properly cleans up compute resources on unload.

Configuration

When both ollama-rocm and ollama-vulkan are installed, Ollama's backend selection logic hardcodes ROCm > Vulkan. The PreferredLibrary() method returns true for ROCm over Vulkan unconditionally. To force Vulkan, you must hide ROCm devices:

/etc/systemd/system/ollama.service.d/vulkan.conf
[Service]
Environment="OLLAMA_LLM_LIBRARY=vulkan"
Environment="ROCR_VISIBLE_DEVICES=NOT_A_DEVICE"
Environment="HIP_VISIBLE_DEVICES=-1"

Then reload:

bash
sudo systemctl daemon-reload
sudo systemctl restart ollama

Ollama log confirms:

library=Vulkan compute=0.0 name=Vulkan0 description="AMD Radeon RX 9060 XT (RADV GFX1200)"

Important

OLLAMA_LLM_LIBRARY=vulkan alone is not sufficient. Despite documentation suggesting this env var forces a specific backend, Ollama 0.18.2 still spawns the ROCm runner alongside the Vulkan one and ROCm wins the dedup. You must also set ROCR_VISIBLE_DEVICES=NOT_A_DEVICE and HIP_VISIBLE_DEVICES=-1 to prevent ROCm from discovering the GPU at all.

Benchmark Results

All tests on AMD RX 9060 XT 16GB, same prompt (“Explain TCP vs UDP in 3 sentences”), num_predict: 100, think: false.

Table 1. Vulkan (RADV) inference performance
ModelPrompt SpeedGeneration SpeedTotal Time
qwen3.5:4b210 t/s29.6 t/s3.8s
qwen3.5:9b165 t/s21.3 t/s9.6s
qwen2.5:7b163 t/s26.3 t/s7.8s
Table 2. GPU behavior comparison between backends
MetricROCm (HIP)Vulkan (RADV)
GPU % during inference100%~96%
GPU % model loaded, idle100% (stuck)15%
GPU % after unload100% (never recovers)14-16% (normal)
GPU clock after unload3.47GHz (max)1.65GHz (dropping to idle)
Power draw after unload66W+ (stuck)27-29W (normal idle)
Recovery after unloadRequires rebootImmediate

Vulkan delivers comparable inference performance to ROCm while completely avoiding the idle GPU bug. The generation speed difference is negligible for the models tested. Prompt processing is actually faster on Vulkan for some models.

Recommendations

For RDNA4 Users

  1. Use ollama-vulkan instead of ollama-rocm for all local LLM inference
  2. If you need ROCm for other workloads (PyTorch training, etc.), run Ollama separately with Vulkan while keeping ROCm available for other applications
  3. This bug may be fixed in future ROCm releases. Check ROCm 7.3+ changelogs

For Ollama Developers

  1. The OLLAMA_LLM_LIBRARY env var doesn't fully override backend selection when ROCm is installed. The dedup logic still prefers ROCm
  2. Consider adding per-model backend selection or a GPU idle monitoring watchdog
  3. The PreferredLibrary() hardcoding of ROCm > Vulkan should be revisitable, especially for GPUs where ROCm has known issues

For Linux Distribution Maintainers

  1. Ship ollama-vulkan as the default Ollama GPU package for AMD systems
  2. ollama-rocm should be an optional add-on, not the default
  3. The Vulkan backend via mesa RADV is more universally reliable across AMD GPU generations

Reproduction Steps

  1. Install ollama-rocm on an RDNA4 system (gfx1200)
  2. Pull any Qwen 3 or Qwen 3.5 model: ollama pull qwen3.5:4b
  3. Run inference: ollama run qwen3.5:4b "hello"
  4. After response completes, check GPU: cat /sys/class/drm/card*/device/gpu_busy_percent
  5. Observe 100% even though no inference is running
  6. Unload the model: curl localhost:11434/api/generate -d '{"model":"qwen3.5:4b","keep_alive":0}'
  7. GPU remains at 100%
  8. Stop Ollama: sudo systemctl stop ollama
  9. GPU remains at 100%. Only reboot fixes it

Files Changed in Costa OS

  • scripts/ollama-manager.sh: Updated comments for Vulkan requirement
  • installer/first-boot.sh: Auto-configures Vulkan backend on AMD GPU detection
  • packages/ai.txt: Added ollama-vulkan as default, documented ROCm caveat
  • knowledge/vram-manager.md: Updated model tiers (qwen3.5), added GPU backend section
  • CLAUDE.md: Updated architecture to note Vulkan backend
  • /etc/systemd/system/ollama.service.d/vulkan.conf: Systemd override (runtime config)

Research conducted on Costa OS development workstation, 2026-03-22. Findings applicable to any Linux system running Ollama with AMD RDNA4 GPUs.