Cannot use multiple GPUs with Titan X GPUs

We have computing nodes which are installed with K40, Titan X and Titan X (PASCAL) GPUs. Our program runs normally with multiple GPUs on nodes with Titan X (PASCAL) and K40. However, our program hangs on nodes with Titan X GPUs. With the help from the administrator, we find that when we use multiple GPUs on nodes with Titan X GPUs, the program makes this system call futex(0x33ee6d60, FUTEX_WAIT_PRIVATE, 0, NULL) and it blocks the whole program.

We test the nodes with this script (https://github.com/pytorch/examples/blob/master/mnist/main.py). We just added DataParallel to the model on line 70.

NVIDIA’s @ngimel has investigated this problem, and the hangs might not be related to pytorch. She has written a detailed comment here on figuring out the issue and working around it:

Please have a look and see if it applies to you.

Thanks! I am asking our system administrators to see if it fixes the issue.