Training hangs on multi-gpu during loss.backward()

Hi all,

I am very much in distress because I have been searching on forums and github issues for solution but still haven’t come across any. I am using a machine with 2x RTX 2080TI. However, when I run my set of codes for multi-gpu using nn.DataParallel, it hangs on the loss.backwards() line of code. It runs fine on single-gpu though, without any issues. This loss function is custom and works as issued on the author’s github page.

I have another set of codes(different work) that can run on both multi-gpu and single-gpu without issues. The difference between this two set of codes are, the one with problems in multi-gpu is actually a custom focal loss nn module, where as the other set of codes which works both in single and multi gpu is based on NN.BCELoss().

May I know what’s going on? What should I look into to diagnose this problem? I am really really clueless and in a lot of anxiety due to deadlines coming up.

Thank you.

Hi all,

I’ve tried running the original codes that utilizes multiprocessing functions on the multi-gpu setup and it works perfectly fine. In my modified set of codes, I’ve modified it to not go through multiprocessing (as I was originally based on Windows, then i shifted to Ubuntu).

May I know if there’s any modules that are required when running on ubuntu? I am very new to ubuntu and I am clueless to where I should start when programming deep learning models.

Thank you.