Training hangs on multi-gpu during loss.backward()

KelS · September 24, 2020, 11:19am

Hi all,

I am very much in distress because I have been searching on forums and github issues for solution but still haven’t come across any. I am using a machine with 2x RTX 2080TI. However, when I run my set of codes for multi-gpu using nn.DataParallel, it hangs on the loss.backwards() line of code. It runs fine on single-gpu though, without any issues. This loss function is custom and works as issued on the author’s github page.

I have another set of codes(different work) that can run on both multi-gpu and single-gpu without issues. The difference between this two set of codes are, the one with problems in multi-gpu is actually a custom focal loss nn module, where as the other set of codes which works both in single and multi gpu is based on NN.BCELoss().

May I know what’s going on? What should I look into to diagnose this problem? I am really really clueless and in a lot of anxiety due to deadlines coming up.

Thank you.

KelS · September 25, 2020, 4:26am

Hi all,

I’ve tried running the original codes that utilizes multiprocessing functions on the multi-gpu setup and it works perfectly fine. In my modified set of codes, I’ve modified it to not go through multiprocessing (as I was originally based on Windows, then i shifted to Ubuntu).

May I know if there’s any modules that are required when running on ubuntu? I am very new to ubuntu and I am clueless to where I should start when programming deep learning models.

Thank you.

winchest · March 11, 2023, 2:33am

HI! I had the same problem Did u fix it?

KelS · March 11, 2023, 4:28am

Hi, I am no longer in this field of work but from what I remember, the two GPUs that I had wasn’t setup properly. What I did was to go through detailed tutorials on setting up a dual GPU setup on Ubuntu.

Sorry if this isn’t much of a help. All the best