The code was copied from here: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
No modifications were done. According to the page, it is suppose to print this:
# on 2 GPUs
Let's use 2 GPUs!
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
However, it hangs after printing Let's use two GPUs
.
The code does runs till the second last line in the tutorial and I can see any print statements I put there.
output = model(input)
Debugging leads to
File "/usr/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
I suspect it’s a race condition since while debugging, I saw one such message
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
The machine has two GPUs (1080 TI) and tensorflow seems to work reasonably smoothly. I understand that pytorch parallelism API is still in active development and not fully tested (according to the documentation itself)
Still, I’m hoping that any information regarding debugging the problem any further without having to learn the whole source from someone with the insight.
I have a fairly good grasp on both python and the underlying C/C++ programming so don’t hold back anything that could help in debugging.
Thank you