My Windows PC is crashing doing some Llama2 training and I’m really struggling to find a solution. This training example will reproduce it… It starts fine but then crashes the PC about an hour in during the train
The crash dump files all mention nvlddmkm.sys. I have the latest driver but have tried some previous versions too. I’ve also tried reducing the batch size to 2 and to 1. I’ve posted on the nVidia board since I’m crashing the PC but no luck here
I’m not a pyTorch expert so if you have suggestions on parameters or other modules to test, that would be a big help. In the mean time, I’ll give it a try on one of my LInux machines.
I hope you are right. Even if I am trying to use too much memory, the OS and drivers should kick back and error, not crash.
I ran the code on my Win11 32gig with a RTX 2070 24gig machine and got the same crash. So at least I have eliminated hardware. If I run the example code as is, it gives a memory error. Changing the batch size from 4 to 1 let me get further and crashes both PCs.
I’m also seeing not a crash with a dump, but an instant Windows reboot when training a model with torch 2.2.2+cu121 and cuda 12.4. Do I need to downgrade to cuda 12.1 or upgrade torch somehow?
Thanks. I don’t think I’ve got the patience for cpu mode. I’ll keep playing and let you know if I find a fix. Both upgrading PyTorch to 12.4 (by building locally from source) and persuading CUDA to downgrade look painful…