[ Solved] nn.DataParallel with ModuleList of custom modules fails on Multiple GPUs


I am trying to wrap my modules via nn.DataParallel and I seem to run into the following error in a multi-GPU forward pass:

File “tourquev5.py”, line 744, in fit
avg_loss = self.train_epoch(training_data,batch_size)
File “tourquev5.py”, line 780, in train_epoch
x1,x2 = self.forward(training_batch,training=True)
File “tourquev5.py”, line 553, in forward
candidate_embeddings = self.batch_process_reviews_without_RNN(candidate_ids)
File “tourquev5.py”, line 374, in batch_process_reviews_without_RNN
File “/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, kwargs)
File “/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 60, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py”, line 70, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 67, in parallel_apply
** raise output

RuntimeError: all tensors must be on devices[0]

Here’s the code snippet that’s actually causing the error:

for i in range(self.review_network_depth):
			#	print "out.size",out.size(), reviews_outputs.size()
				with torch.cuda.device(self.GPUS[0]):
					print out.get_device(),reviews_outputs.get_device() #self.review_network[i].get_device()
					out = self.review_network[i](self.mult_sent_emb_with_network_output(out_clone,outputs_clone))

And the method in the argument is :

def mult_sent_emb_with_network_output(self, out,review_outputs):
		#with torch.cuda.device(self.GPUS[0]):
		for batch in xrange(review_outputs.size(0)):
			for word_emb in xrange(review_outputs.size(1)):
				#with torch.cuda.device(self.GPUS[0]):
		print "exiting from mult_sent_emb_with_network_output on GPU ID:",review_outputs_clone.get_device()
		return review_outputs_clone

I have tried different ways of writing the code wondering if 1) clone was causing a problem 2) if updating the tensor in a loop could have implications for DataParallel (so you see me using an array outputs[i].

So what I noticed was that when the code to shifts the forward pass of self.review_network[i] it creates ‘n’ threads, where n = No of GPUs I want to parallelize over. From the calling function in the snippet above, all my tensors reside on GPU ID 0 and so the first thread finishes successfully. However the second thread has everything on the GPU ID 1 within the forward pass and forcing a GPU using .cuda(devices=gpus[0]) or using the cuda context manager seems to have no effect. I dont know if this has something to do with it but I see that happen:

Here’s some output:

exiting from mult_sent_emb_with_network_output on GPU ID: 0
[0, 1, 2, 3]
ReviewSentenceEncoderLayer:input GPU ID: 0
ReviewSentenceEncoderLayer: updated input GPU ID: 0
[0, 1, 2, 3]
ReviewSentenceEncoderLayer:input GPU ID: 1
ReviewSentenceEncoderLayer: updated input GPU ID: 1
<class ‘torch.autograd.variable.Variable’>
Sent Embedding device_id: 0
ReviewSentenceEncoderLayer: mid GPU ID: 0
PositionwiseFeedForward: residual device: 0
PositionwiseFeedForward: output device: 0
layer_norm output device: 0
ReviewSentenceEncoderLayer: out.get_device 0

So the ReviewSentenceNecoderLayer and PositionWiseFeedForward network are the ones invoked during the forward pass and threaded by pytorch.I guess, since I’m parallelizing expecting them to be on different GPUs is only natural but I cant figure out what to do to make the error go away! I am guessing its the second thread which is on GPU ID 1 thats causing this error ?

I have seen (and tried ) solutions in : https://github.com/pytorch/pytorch/issues/1280, Tensors are on different GPUS, How to train this model on multi GPUs, https://github.com/pytorch/pytorch/issues/1150 , How to change the default device of GPU? device_ids[0]

Am I missing something? Any leads would be very helpful! Thanks!

[Edit] : Here’s how I’m initializing (in case this helps):
self.review_network = nn.ModuleList( [nn.DataParallel(helperLayers.ReviewSentenceEncoderLayer(hidden_size,dropout=0.1).type(self.dtype)) for i in range(self.review_network_depth)])

I have tried using device arguments, .cuda() with gpu arguments etc and it hasn’t helped either.

Also note, I can run the code on a single GPU without any changes.

Version details:
Pytorch Version: 0.2.0_3
Using CUDA, number of devices:2

Turns out the nn.Parameters in one of my networks weren’t getting replicated across the GPUs. I am not sure why that is. I had to include a cuda(gpu_id) on the two parameters and everything worked. No other fixes were necessary.

1 Like