First of all, I want to apologize as I’m not a programmer or developer, so I may not have a complete understanding at the same level as many of you here.
I’m using Anaconda + PyTorch to create guitar profiles through Neural Amp Modeler, which performs local training on this infrastructure.
I recently switched from an Intel 14900K to an AMD 9950X, and now I can’t complete the training, something I’ve never had issues with before.
GPU remained the same: RTX4070.
I’ve already tried uninstalling Anaconda completely, reinstalling it, and reinstalling PyTorch, but no luck.
Does anyone have any ideas on how to solve this? I really need to get back to work on this project.
I’m linking two text files with the content of the Anaconda Prompt here: Dropbox
Could you check if dmesg is reporting any Xid error codes?
Also, are you able to move the tensor from your example to the GPU as the verification is incomplete?
x = torch.randn(16, 16, device="cuda")
print(x)
print(torch.matmul(x, x))
and make sure the output shows a valid tensor without raising any errors.
If so, run dmesg -T | grep -i xid in your terminal next and check if any errors are raised and correlate to failures in your script when checking the timestamp.
I couldn’t understand why the system was exactly the same, but I was having issues. So, after some research, I noticed instabilities with XMP1 on the AMD platform and decided to do an A/B comparison of the training process with and without it. With it, I had the issue I mentioned earlier, but without it, I was able to complete the training without problems!