Hi all,
I am very much in distress because I have been searching on forums and github issues for solution but still haven’t come across any. I am using a machine with 2x RTX 2080TI. However, when I run my set of codes for multi-gpu using nn.DataParallel, it hangs on the loss.backwards() line of code. It runs fine on single-gpu though, without any issues. This loss function is custom and works as issued on the author’s github page.
I have another set of codes(different work) that can run on both multi-gpu and single-gpu without issues. The difference between this two set of codes are, the one with problems in multi-gpu is actually a custom focal loss nn module, where as the other set of codes which works both in single and multi gpu is based on NN.BCELoss().
May I know what’s going on? What should I look into to diagnose this problem? I am really really clueless and in a lot of anxiety due to deadlines coming up.
Thank you.