I am trying to parallelize the following network - https://pastebin.com/raw/FgehWHw0
by using DataParallel as suggested on the forums as - network = torch.nn.DataParallel(network, device_ids = args.gpus) where network is the a network of the type RNN Encoder and args.gpus is a list of available GPU device IDs.
I keep running into the following error in doing so - https://pastebin.com/raw/0nfmmhkz
When I was trying to run this on a single GPU (defauly pytorch setting) I was getting an out of memory error, which made me resort to parallelizing my code.
Is there a reasonable fix?
DataParallel splits the batch between different GPUs. How many GPUs are you using and what is your batch size?
Also, since you are out of memory I doubt
DataParallel will help you, since it replicates the model to all devices.
Have you thought about model sharding?
def __init__(self, split_gpus):
self.large_submodule1 = ...
self.large_submodule2 = ...
self.split_gpus = split_gpus
def forward(self, x):
x = self.large_submodule1(x)
x = x.cuda(1) # P2P GPU transfer
Thanks for your reply. I have 6 GPUs (I was using 3 of those to run this code as my device ids), my batch size is 128. I haven’t used model sharding. Will try it out. Thanks for the pointer!
I was trying out your solution of model sharding. However, I am having trouble doing
loss.backward() Since my loss is a sum of the losses from the 2 decoders I am situating on different GPUs. Would you have pointers about a workaround to this issue?
My error trace specifically is:
Train epch 1, 0.00 s - (Done 1 of 73) Traceback (most recent call last):
File "main_interlingua.py", line 432, in <module>
loss, norm_e, norm_d = train(args, train_batch, encoder, decoder, decoder2, encoder_optimizer, decoder_optimizer, decoder_optimizer2, criterion)
File "main_interlingua.py", line 181, in train
File "/usr0/home/spoddar2/anaconda/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/usr0/home/spoddar2/anaconda/lib/python2.7/site-packages/torch/autograd/__init__.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1512378422383/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:269
How did you calculate your
It should be something like
loss = loss0 + loss1.cuda(0).
Could you post the code snippet right before the error was thrown?