RuntimeError: CUDA error: unspecified launch failure // Neural Amp Modeler

giacomopas · September 17, 2024, 6:00pm

Hi everyone,

First of all, I want to apologize as I’m not a programmer or developer, so I may not have a complete understanding at the same level as many of you here.

I’m using Anaconda + PyTorch to create guitar profiles through Neural Amp Modeler, which performs local training on this infrastructure.

I recently switched from an Intel 14900K to an AMD 9950X, and now I can’t complete the training, something I’ve never had issues with before.
GPU remained the same: RTX4070.

I’ve already tried uninstalling Anaconda completely, reinstalling it, and reinstalling PyTorch, but no luck.

Does anyone have any ideas on how to solve this? I really need to get back to work on this project.

I’m linking two text files with the content of the Anaconda Prompt here: Dropbox

Thanks so much for your help!

giacomopas · September 18, 2024, 7:37am

@ptrblck some kind of help might be highly appreciated! Thanks so much!

giacomopas · September 18, 2024, 7:59am

I also did the verification you have here: Start Locally | PyTorch

Thanks to anyone who can help!

ptrblck · September 18, 2024, 12:23pm

Could you check if dmesg is reporting any Xid error codes?
Also, are you able to move the tensor from your example to the GPU as the verification is incomplete?

giacomopas · September 18, 2024, 12:41pm

Thanks for your answer!
As I wrote, I’m really unexperienced about Anaconda/Pytorch.
Is there a guide to check what you asked me?

Thanks so much…

ptrblck · September 24, 2024, 5:19pm

In your code, allocate a tensor on the GPU via:

x = torch.randn(16, 16, device="cuda")
print(x)
print(torch.matmul(x, x))

and make sure the output shows a valid tensor without raising any errors.

If so, run dmesg -T | grep -i xid in your terminal next and check if any errors are raised and correlate to failures in your script when checking the timestamp.

giacomopas · September 29, 2024, 5:27pm

Just a quick update that solved the issue, just in case it might be useful for someone else.
I was running XMP profile on my computer.

Disabling it fixed the issue without any other step.

ptrblck · September 29, 2024, 11:33pm

Thanks for sharing your solution! How did you figure out the issues were caused by XMP?

giacomopas · September 30, 2024, 12:51pm

I couldn’t understand why the system was exactly the same, but I was having issues. So, after some research, I noticed instabilities with XMP1 on the AMD platform and decided to do an A/B comparison of the training process with and without it. With it, I had the issue I mentioned earlier, but without it, I was able to complete the training without problems!