Torch crashing nVidia driver with this simple example :-( Help

burhop · March 16, 2024, 4:37pm

My Windows PC is crashing doing some Llama2 training and I’m really struggling to find a solution. This training example will reproduce it… It starts fine but then crashes the PC about an hour in during the train

Windows 10 32Gig
RTX 4080 32Gig
Python 3.11.8
torch 2.2.1+cu121
torchaudio 2.2.1+cu121
torchvision 0.17.1

NVidia Drivers 551.61 (release date: 02/22/2024)

The crash dump files all mention nvlddmkm.sys. I have the latest driver but have tried some previous versions too. I’ve also tried reducing the batch size to 2 and to 1. I’ve posted on the nVidia board since I’m crashing the PC but no luck here

I’m not a pyTorch expert so if you have suggestions on parameters or other modules to test, that would be a big help. In the mean time, I’ll give it a try on one of my LInux machines.

ptrblck · March 16, 2024, 5:45pm

I doubt the issue is related to PyTorch, so you might want to follow up in the cross-post.

burhop · March 17, 2024, 3:22pm

I hope you are right. Even if I am trying to use too much memory, the OS and drivers should kick back and error, not crash.

I ran the code on my Win11 32gig with a RTX 2070 24gig machine and got the same crash. So at least I have eliminated hardware. If I run the example code as is, it gives a memory error. Changing the batch size from 4 to 1 let me get further and crashes both PCs.

A Linux machine with a 3080 is next.

OracPrime · April 2, 2024, 9:26pm

I’m also seeing not a crash with a dump, but an instant Windows reboot when training a model with torch 2.2.2+cu121 and cuda 12.4. Do I need to downgrade to cuda 12.1 or upgrade torch somehow?

burhop · April 3, 2024, 2:11pm

I did upgrade to the NVIDIA drivers that just came out but no joy.

Setting PyTorch to “cpu” allows it to work (if you have lots of time)

Running it on the Linux machine didn’t get far as I immediately get an error message saying it can’t access enough GPU memory.

OracPrime · April 3, 2024, 2:13pm

Thanks. I don’t think I’ve got the patience for cpu mode. I’ll keep playing and let you know if I find a fix. Both upgrading PyTorch to 12.4 (by building locally from source) and persuading CUDA to downgrade look painful…