Been trying to troubleshoot this for a while, on my Fedora 40 and RX 6900 XT system.
I have torchtune compiled from the github repo, installed ROCM 6.0 from the official fedora 40 repos, which I uninstalled to install ROCM 6.1.2 from AMD’s ROCM repo following their documentation for RHEL 9.4. I originally had pytorch 2.5-rocm6.0, which I’ve updated to the latest nightly for 2.5-rocm6.1.
I still always get nan in loss when training. One of the torchtune devs gave me a recipe for training in fp16, this more than tripled my training speed from 25t/s to 79t/s but still shows my loss as nan. All testing has been done on training a lora for phi mini using a small 10k line dataset and 32 seq length for testing purposes. Both my bf16 and fp16 recipes have been confirmed working fine on nvidia machines without nan in their training loss by others.
Sidenote, I also had an issue with an hipblaslt error which I worked around with export TORCH_BLAS_PREFER_HIPBLASLT
(see HIPBLASLT error, and the work around for AMD/ROCM users who are getting it · pytorch/torchtune · Discussion #1108 · GitHub for more details).
Update: the issue only happens with the creativegpt dataset by N8programs. When I use the alpaca dataset I can see loss just fine. However I’ve had others try this dataset with my same recipe/configs on their Nvidia machines and they don’t get any issues like me.