I am actually facing a problem on my project.
The learning process (using CUDA) is randomly crashing without any error message (most of the time). It can happen after 1min or after 35min.
It only happens if I use both my GPU’s (mismatched 1080Ti and 2080Ti).
If I use only one everything is fine.
We had a power problem (insufficient power supply) but now it is solved and the problem is still here.
Please note that rarely a message about memory access appears (during the crash), not sure it is related. It says an instruction referenced memory that could not be written, always memory with very small address (0x00000…00020)or very big (0xFFFFFFFF…FF).
We also observed that if the model is more heavy the crash happens sooner.
What could cause such crashes ?
Thanks for your help
Are you running the script in a terminal of notebook?
The latter might hide the actual error message?
Also, how did you verify that your power supply issue is solved?
Are the GPUs running fine on their own?
Running in visual code terminal, or separate terminal. (same problem)
Concerning power supply, we had also drops in CPU frequency, they are now gone.
Running the same code on a single GPU is fine (just slower)
What are the specs of your current PSU?
New setup is dual 750W.
Something worth knowing is that we do Model sharing when using MultiGPU. We run 2 architecture (same nature and complexity) on 2 sets of data. We do no use DataParrallel.
Do you have another PSU with more power by chance?
The 1080Ti built into a workstation might need 600-650 Watts, while the 2080Ti might need additional ~270+ Watts. I don’t know how your system setup looks like, so I just used public information from tomshardware etc.
Sadly we don’t, but we are convinced that it is not a problem.
According to the server company the power supplies are not simple redundancy but share their power.
And that is reflected but the measures we made on our side.
We measured a power draw of roughly 850w at peak (which indeed follows your estimation).
That shows that in fact power delivery is capable of more than one unit (750w) , therefore we should have a capacity of 1500w (more than enough) with the 2 units
Thank you for your help !
@ptrblck Could that be an incompatibility between the 2 cards at the CUDA level ?
I’m not aware of compatibility issues as long as the CUDA version is not too old for either device.