The toy example is brevity.
According to your example, I reproduced it in my codes . However, there is a weird bug. Do you know what cause this ?
Traceback (most recent call last):
File "train.py", line 363, in <module>
main(args)
File "train.py", line 97, in main
train(args, trainer, task, epoch_itr)
File "train.py", line 135, in train
log_output = trainer.train_step(sample, update_params=True)
File "/data/mmyin/tf-datapallelism/fairseq/trainer.py", line 120, in train_step
loss, sample_size, logging_output, oom_fwd = self._forward(sample)
File "/data/mmyin/tf-datapallelism/fairseq/trainer.py", line 212, in _forward
oss, sample_size, logging_output_ = self.full_model(sample)
File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
return self.gather(outputs, self.output_device)
File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/mmyin/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration
Well that error means that zip requires itarable inputs, such as list or tuples. Therefore it is taking as input something which is non iterable. What exactly? I don’t know since I don’t have your implementation
Thanks for your advice.
I have a question about final loss. Because I found the final loss is not sum of a batch, but a list containing mini batch loss.
For example, a big batch is divided into four parts for computing on 4 GPUs, while loss,_ = model(gt,input) will return a list containing four partial losses.
Is that correct and does need to sum them up manually?
We had the same issue, in that we could only train with a much smaller batch size when parallelizing.
Using DistributedDataParallel in both model and loss got us much better results. You have to use DistributedSampler and init_process_group, but it’s all in this example: https://github.com/pytorch/examples/blob/master/imagenet/main.py
However, we have not seen massive improvements in speed, probably due to our slow dataloader/data transfer as our input size is quite large…
Both methods, DistributedDataParallel or DataParallel, running on a AWS P3 with 8 GPUs barely improved at all compared to a single GPU (perhaps the variation on the time required for an epoch is reduced, but the average time is about the same). That doesn’t make much sense, has anyone seen the same problem?
hi, I found your toy code solution for the dataparallel problem.Your work is fantastic.
But when I immitated it on my own code, things went wrong.
it gave me RuntimeError: all tensors must be on devices[0]
Here is my Model_with_parallel:
Beg for your help!
I have tried to dataparallel my model and loss partly, the code could run.But still the GPU-Util on other gpus except device1 is almostly zero.
thank you very much!!
I’ve had the same imbalanced problem due to a very complex regularization method in CNN. The method in this post is kind of complicated and requires a lot of code changing. If you are also using a layer wise regularization method, you can try gradient accumulation as a quick fix. The idea is to calculate the regularizer gradient of a single layer at a time, and let the graph be freed before calculating the next one.
I am having the same imbalance issue but the problem is that my gpu 1 not gpu 0 is going out of memory. Both gpus have 32GB of memory. With NVIDIA-SMI i see that gpu 0 is only using 6GB of memory whereas, gpu 1 goes to 32.
I could have understood if it was other way around with gpu 0 going out of memory but this is weird.
I only pass my model to the DataParallel so it’s using the default values.
Also, if I use only 1 GPU, i don’t get any out of memory issues. This is also strange for me.
Any help would be appreciated.
p.s. I was getting warning about rnn parameters not being in contiguous memory so i added the flatten_parameters() call as well in forward of lstm
def forward(self, inputs, mode='train'):
packed = tn.pack_sequence(inputs, enforce_sorted=False)
self.hidden = self.init_hidden(len(inputs), packed.data.device) // sending device of packed so both packed and self.hidden are on same device, as self.hidden is created in every call and im using multiple gpus
self.lstm.flatten_parameters()
if mode == 'eval' or mode == 'test':
with torch.no_grad():
packed_out, self.hidden = self.lstm(packed, self.hidden)
else:
packed_out, self.hidden = self.lstm(packed, self.hidden)
outputs, lens = tn.pad_packed_sequence(packed_out, batch_first=True)
return outputs