ROCm HIP Idle GPU Bug on RDNA4: Vulkan Workaround

Abstract

When loading Qwen 3 or Qwen 3.5 models via Ollama using the ROCm/HIP backend on AMD RDNA4 GPUs (gfx1200), the GPU becomes permanently stuck at 100% utilization and maximum clock speed, even after inference completes and the model is unloaded. We trace the root cause to HSA runtime teardown in the ROCm stack, benchmark the Vulkan (RADV) backend as a drop-in replacement, and find comparable inference performance with correct GPU idle behavior.

Hardware	AMD Radeon RX 9060 XT 16GB (RDNA 4, gfx1200)
Software	Arch Linux, kernel 6.19.8, Ollama 0.18.2, ROCm 7.2.0, mesa (RADV)
Affected Models	qwen3, qwen3.5 (all sizes tested: 4b, 9b, 14b)
Severity	Critical . Renders the GPU unusable until reboot

The Bug

When loading Qwen 3 or Qwen 3.5 models via Ollama using the ROCm/HIP backend on AMD RDNA4 GPUs (gfx1200), the GPU becomes permanently stuck at 100% utilization and maximum clock speed, even after:

The model finishes inference
The model is explicitly unloaded (keep_alive: 0)
The Ollama process is stopped (systemctl stop ollama)

The GPU remains at max P-state (3.47GHz), drawing 66W+ at idle, until the system is rebooted or the amdgpu kernel module is reloaded (which kills the display server).

Symptoms

gpu_busy_percent reads 100% continuously via /sys/class/drm/card*/device/gpu_busy_percent
GPU clock locked at highest P-state (3.47GHz instead of 500MHz idle)
Power draw stays at 66W+ (normal idle is ~15-20W)
Desktop compositor (Hyprland/Wayland) becomes sluggish
GPU-accelerated terminals (Ghostty) and shell bars (AGS) crash from starved GPU resources
The bug persists even after killing all Ollama processes. The HIP runtime leaves the GPU in a stuck state

What Doesn't Work

ollama stop / keep_alive: 0: Model unloads but GPU stays stuck
systemctl restart ollama: Ollama restarts but GPU never recovers
AMD GPU reset via sysfs (/sys/class/drm/card*/device/gpu_reset). Not available on RDNA4
OLLAMA_LLM_LIBRARY=cpu env var. Ollama ignores it and still loads on GPU
Only a full reboot or amdgpu module reload clears the stuck state

Not Affected

Qwen 2.5 models (all sizes). Work correctly on ROCm, GPU idles normally
Gemma 3 models. Work correctly on ROCm
Any model on the Vulkan backend. All work correctly

Root Cause Analysis

The bug is in the HSA (Heterogeneous System Architecture) runtime teardown within the ROCm HIP stack. When Qwen 3/3.5 models run inference:

The HIP runtime allocates GPU compute queues via the HSA API
During inference, the GPU operates normally at high utilization
When inference completes or the model is unloaded, the HIP runtime fails to properly release/destroy the compute queues
The GPU hardware continues executing the orphaned queue dispatch, maintaining max clock and 100% utilization
Even after the HIP process exits, the kernel-level HSA queue state persists in the amdgpu driver
The only recovery is resetting the GPU at the driver level (module reload) or rebooting

This appears to be related to ROCm bug #5706 and the relatively new gfx1200 (RDNA4) support in ROCm 7.x. The bug is model-architecture-specific. Qwen3's architecture (likely its attention pattern or KV cache implementation in GGML) triggers a code path in the HIP runtime that doesn't clean up properly on RDNA4.

Why Qwen3 but not Qwen2.5?

The GGML/llama.cpp HIP kernels dispatch differently based on model architecture. Qwen3/3.5 uses a different attention mechanism that may invoke HIP graph capture or persistent kernel dispatch. Features that have known teardown issues on newer AMD GPU architectures. The specific interaction between:

GGML's ROCm kernel dispatch for Qwen3's architecture
rocBLAS kernel selection for gfx1200
HSA queue management in ROCm 7.2.0

...creates a scenario where queue cleanup fails silently.

The Fix: Vulkan Backend

Discovery

Ollama supports multiple GPU backends via shared libraries:

libggml-hip.so: ROCm/HIP (AMD)
libggml-vulkan.so: Vulkan (all GPUs)
libggml-cuda.so: CUDA (NVIDIA)

On Arch Linux, these are separate packages: ollama-rocm, ollama-vulkan, ollama-cuda.

The Vulkan backend uses mesa's RADV driver instead of ROCm's HSA/HIP stack. RADV is the open-source Vulkan driver maintained by the Mesa project and has excellent RDNA4 support. It uses a completely different GPU resource management path that properly cleans up compute resources on unload.

Configuration

When both ollama-rocm and ollama-vulkan are installed, Ollama's backend selection logic hardcodes ROCm > Vulkan. The PreferredLibrary() method returns true for ROCm over Vulkan unconditionally. To force Vulkan, you must hide ROCm devices:

/etc/systemd/system/ollama.service.d/vulkan.conf

[Service]
Environment="OLLAMA_LLM_LIBRARY=vulkan"
Environment="ROCR_VISIBLE_DEVICES=NOT_A_DEVICE"
Environment="HIP_VISIBLE_DEVICES=-1"

Then reload:

bash

sudo systemctl daemon-reload
sudo systemctl restart ollama

Ollama log confirms:

library=Vulkan compute=0.0 name=Vulkan0 description="AMD Radeon RX 9060 XT (RADV GFX1200)"

Important

OLLAMA_LLM_LIBRARY=vulkan alone is not sufficient. Despite documentation suggesting this env var forces a specific backend, Ollama 0.18.2 still spawns the ROCm runner alongside the Vulkan one and ROCm wins the dedup. You must also set ROCR_VISIBLE_DEVICES=NOT_A_DEVICE and HIP_VISIBLE_DEVICES=-1 to prevent ROCm from discovering the GPU at all.

Benchmark Results

All tests on AMD RX 9060 XT 16GB, same prompt (“Explain TCP vs UDP in 3 sentences”), num_predict: 100, think: false.

Table 1. Vulkan (RADV) inference performance

Model	Prompt Speed	Generation Speed	Total Time
qwen3.5:4b	210 t/s	29.6 t/s	3.8s
qwen3.5:9b	165 t/s	21.3 t/s	9.6s
qwen2.5:7b	163 t/s	26.3 t/s	7.8s

Table 2. GPU behavior comparison between backends

Metric	ROCm (HIP)	Vulkan (RADV)
GPU % during inference	100%	~96%
GPU % model loaded, idle	100% (stuck)	15%
GPU % after unload	100% (never recovers)	14-16% (normal)
GPU clock after unload	3.47GHz (max)	1.65GHz (dropping to idle)
Power draw after unload	66W+ (stuck)	27-29W (normal idle)
Recovery after unload	Requires reboot	Immediate

Vulkan delivers comparable inference performance to ROCm while completely avoiding the idle GPU bug. The generation speed difference is negligible for the models tested. Prompt processing is actually faster on Vulkan for some models.

Recommendations

For RDNA4 Users

Use ollama-vulkan instead of ollama-rocm for all local LLM inference
If you need ROCm for other workloads (PyTorch training, etc.), run Ollama separately with Vulkan while keeping ROCm available for other applications
This bug may be fixed in future ROCm releases. Check ROCm 7.3+ changelogs

For Ollama Developers

The OLLAMA_LLM_LIBRARY env var doesn't fully override backend selection when ROCm is installed. The dedup logic still prefers ROCm
Consider adding per-model backend selection or a GPU idle monitoring watchdog
The PreferredLibrary() hardcoding of ROCm > Vulkan should be revisitable, especially for GPUs where ROCm has known issues

For Linux Distribution Maintainers

Ship ollama-vulkan as the default Ollama GPU package for AMD systems
ollama-rocm should be an optional add-on, not the default
The Vulkan backend via mesa RADV is more universally reliable across AMD GPU generations

Reproduction Steps

Install ollama-rocm on an RDNA4 system (gfx1200)
Pull any Qwen 3 or Qwen 3.5 model: ollama pull qwen3.5:4b
Run inference: ollama run qwen3.5:4b "hello"
After response completes, check GPU: cat /sys/class/drm/card*/device/gpu_busy_percent
Observe 100% even though no inference is running
Unload the model: curl localhost:11434/api/generate -d '{"model":"qwen3.5:4b","keep_alive":0}'
GPU remains at 100%
Stop Ollama: sudo systemctl stop ollama
GPU remains at 100%. Only reboot fixes it

Files Changed in Costa OS

scripts/ollama-manager.sh: Updated comments for Vulkan requirement
installer/first-boot.sh: Auto-configures Vulkan backend on AMD GPU detection
packages/ai.txt: Added ollama-vulkan as default, documented ROCm caveat
knowledge/vram-manager.md: Updated model tiers (qwen3.5), added GPU backend section
CLAUDE.md: Updated architecture to note Vulkan backend
/etc/systemd/system/ollama.service.d/vulkan.conf: Systemd override (runtime config)

Research conducted on Costa OS development workstation, 2026-03-22. Findings applicable to any Linux system running Ollama with AMD RDNA4 GPUs.