MultiGPU tutorial broken?

bhoot · July 2, 2018, 1:02am

The code was copied from here: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

No modifications were done. According to the page, it is suppose to print this:

# on 2 GPUs
Let's use 2 GPUs!
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

However, it hangs after printing Let's use two GPUs.
The code does runs till the second last line in the tutorial and I can see any print statements I put there.
output = model(input)

Debugging leads to
File "/usr/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock

I suspect it’s a race condition since while debugging, I saw one such message
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])

The machine has two GPUs (1080 TI) and tensorflow seems to work reasonably smoothly. I understand that pytorch parallelism API is still in active development and not fully tested (according to the documentation itself)

Still, I’m hoping that any information regarding debugging the problem any further without having to learn the whole source from someone with the insight.
I have a fairly good grasp on both python and the underlying C/C++ programming so don’t hold back anything that could help in debugging.

Thank you

ptrblck · July 2, 2018, 8:50am

Which PyTorch version are you using?
I assume you are using the default DataLoader from the tutorial, i.e. num_workers=0.
Could you execute the code line by line in a Python shell and see, if the creation of DataParallel hangs or if it’s the training loop using the DataLoader?

bhoot · July 2, 2018, 9:18am

It’s the processing. Namely the last line that feeds the input to the model. I’ve gotten as far as the parallel_apply which does not return.

py(124)parallel_apply()
-> return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
(Pdb) 
	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])

Even going deeper, I can trace it down to thread.start() inside parallel_apply.
And further down,

> /usr/lib/python3.6/threading.py(828)start()
-> def start(self):
(Pdb) n
> /usr/lib/python3.6/threading.py(838)start()
-> if not self._initialized:
(Pdb) 
> /usr/lib/python3.6/threading.py(841)start()
-> if self._started.is_set():
(Pdb) 
> /usr/lib/python3.6/threading.py(843)start()
-> with _active_limbo_lock:
(Pdb) 
> /usr/lib/python3.6/threading.py(844)start()
-> _limbo[self] = self
(Pdb) 
> /usr/lib/python3.6/threading.py(845)start()
-> try:
(Pdb) 
> /usr/lib/python3.6/threading.py(846)start()
-> _start_new_thread(self._bootstrap, ())
(Pdb) s
> /usr/lib/python3.6/threading.py(851)start()
-> self._started.wait()
(Pdb) 	In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])

At this point, it hangs wtihout stepping in. I’m guessing that this is mapped to pthread call. I can fire up gdb and see where it leads. However it would be counterproductive if it is a known issue and/or I’m just doing something silly.