PyTorch on ROCm in Docker in QEMU fails

Imred_Gemu · November 15, 2023, 12:20am

I know the obvious answer to this is probably that there are just too many layers of jank, and it’s even worse than the title. I’m running PyTorch from the rocm/pytorch Docker container. I’m starting the container by running:

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G -v /home/username/dockerx:/dockerx -w /dockerx rocm/pytorch

Specifically I’m trying to run ExLlamaV2, but the problem occurs even running simple tools like test-pytorch-gpu. Various function calls fail with a runtime error and always the frustratingly unhelpful error:

HIP error: the operation cannot be performed in the present state

I’ve tried this running on multiple distros, including a fresh Ubuntu 23.04 with just ROCm kernel modules and Docker installed and loading correctly. These systems are running on Libvirt/QEMU on KVM, with the GPU accessible through PCIe passthrough. My GPU is an RX 6800 XT, which according to ROCm’s documentation isn’t officially supported apparently, but ROCm seems to load normally in spite of this, though possibly with missing functionality? Probably not helping things, I’m working on a laptop with no internal GPU, the 6800 is connected via a Thunderbolt 4 eGPU case. I just need this working for testing purposes, so I’m open to suggestions even if they would sacrifice performance. At the very least I’m hoping for some help debugging this to find exactly what the problem is. If it is just the case that it’s not possible to run on my setup at all, I would like to confirm that.