Torch.ger never returning in 1.1.0

jts · May 28, 2019, 3:24pm

I’m playing with the Q&A code from Facebook Research and had it working perfectly. It was running with torch 0.4.0 on a GTX 1080 with cuda 9.0 and Python 3.6.7. I got a new machine with two RTX 2080 Ti cards. I tried just updating everything to the latest, torch 1.1.0, cuda 10.2, latest nVidia driver. The app would hang at the torch.ger call. I tried doing just CPU (no cuda) and got the same result. I goes into the function and never returns.

I downgraded torch to 1.0.1.post2 and it works fine with the CPU. I still need to get all the GPU stuff sorted out to fix the CUDNN_STATUS_EXECUTION_FAILED error but for now I’d just like to figure out why the torch.ger function is not returning.

It’s being passed two tensors of shape [452] of type torch.float32. Comparing the tensors in 1.0.1.post2 vs 1.1.0 in a debugger they are almost identical. There are a few values that are a tiny bit different but nothing that should break it.

One additional bit of information, the code uses the multiprocessing library. I read that the pytorch wrapper around it is preferred. I was unable to switch over because of some missing functions like Finalize. But just looking at the ger function, it doesn’t seem like multiprocessing is the problem.

Any ideas what I should be looking at?