Using two DataParallel in one Architecture

I am using 4 GPUs, let’s say device0, device1, device2, and device3. And the model is:

  • Inside the init function:
sub_model1.to(device0)
sub_model1 = torch.nn.DataParallel(sub_model1, device_ids=[device0, device1])
sub_model2.to(device2)
sub_model2 = torch.nn.DataParallel(sub_model2, device_ids=[device2, device3])
  • Inside the forward function:
y = sub_model1(x)
#y = y.to(device2) # we dont need this as we can put data in any device when using DataParallel
out = sub_model2(y)

This gives me out of memory problems after running perfectly for some epochs. The error is written like (RuntimeError: Caught RuntimeError in replica 1 on device 3.).

Am I doing it correctly? In my case, if I don’t use the DataParallel, it works perfectly with 2GPUs (I simply double the batch_size when using this DataParallel setting, 4 GPUs)

The error message doesn’t sound like an OOM error, but a RuntimeError.
I assume you are not seeing this error using a single GPU?

Do you get any more information from the stack trace?

Hi thanks for your reply

This is the complete error:

2019-11-10 10:56:51,904 - Parser - Current learning rate: 0.002000
2019-11-10 10:56:59,308 - Parser - Epoch 0, Batch 1, AvgCost: 2.10, CorrectSpan: 0.49, CorrectNuclear: 0.37, CorrectRelation: 0.03 - 0 mins 7 secs
2019-11-10 10:57:00,668 - Parser - Epoch 0, Batch 2, AvgCost: 2.08, CorrectSpan: 0.51, CorrectNuclear: 0.36, CorrectRelation: 0.02 - 0 mins 8 secs
2019-11-10 10:57:03,495 - Parser - Epoch 0, Batch 3, AvgCost: 2.11, CorrectSpan: 0.51, CorrectNuclear: 0.35, CorrectRelation: 0.03 - 0 mins 11 secs
2019-11-10 10:57:04,270 - Parser - Epoch 0, Batch 4, AvgCost: 2.10, CorrectSpan: 0.52, CorrectNuclear: 0.36, CorrectRelation: 0.03 - 0 mins 12 secs
2019-11-10 10:57:04,866 - Parser - Epoch 0, Batch 5, AvgCost: 2.04, CorrectSpan: 0.52, CorrectNuclear: 0.34, CorrectRelation: 0.03 - 0 mins 12 secs
2019-11-10 10:57:09,912 - Parser - Epoch 0, Batch 6, AvgCost: 2.05, CorrectSpan: 0.53, CorrectNuclear: 0.35, CorrectRelation: 0.05 - 0 mins 18 secs
2019-11-10 10:57:12,131 - Parser - Epoch 0, Batch 7, AvgCost: 2.07, CorrectSpan: 0.52, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 20 secs
2019-11-10 10:57:12,906 - Parser - Epoch 0, Batch 8, AvgCost: 2.06, CorrectSpan: 0.53, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 21 secs
2019-11-10 10:57:13,351 - Parser - Epoch 0, Batch 9, AvgCost: 2.06, CorrectSpan: 0.53, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 21 secs
2019-11-10 10:57:13,651 - Parser - Epoch 0, Batch 10, AvgCost: 2.08, CorrectSpan: 0.53, CorrectNuclear: 0.37, CorrectRelation: 0.04 - 0 mins 21 secs
Traceback (most recent call last):
  File "train.py", line 319, in <module>
    main()
  File "train.py", line 238, in main
    cost, cost_val = network.loss(subset_data, gold_subtrees, epoch=epoch)
  File "/home/ffajri/Workspace/neural_project/models/architecture.py", line 497, in loss
    cost = self.decode(encoder_output, gold_nuclear, gold_relation, gold_segmentation, span, len_golds)
  File "/home/ffajri/Workspace/neural_project/models/architecture.py", line 381, in decode
    segment_output, transformer_output = self.run_transformer_segmentation(hidden_state1, segment_mask) #output in cuda-2
  File "/home/ffajri/Workspace/neural_project/models/architecture.py", line 98, in run_transformer_segmentation
    edus_score, edus_vec  = self.transformer_segmenter(segmented_encoder, segment_mask.int())
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 3.
Original Traceback (most recent call last):
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ffajri/Workspace/neural_project/modules/encoder.py", line 95, in forward
    x = self.transformer_inter[i](i, x, x, mask!=1)  # all_sents * max_tokens * dim
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ffajri/Workspace/neural_project/modules/encoder.py", line 68, in forward
    mask=mask)
  File "/home/ffajri/anaconda3/envs/py3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ffajri/Workspace/neural_project/modules/neural.py", line 410, in forward
    query = query / math.sqrt(dim_per_head)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 3; 15.75 GiB total capacity; 11.64 GiB already allocated; 2.12 MiB free; 2.97 GiB cached)

Without DataParallel, it requires me to run it with at least two GPUs, and it works perfectly.

For now, I think the problem is because I am calling the sub_model2 within a loop.

y = sub_model1(x)
while (not_finished()):
     y1 = pick_some_index(y)
     out = sub_model2(y1)

Based on this https://erickguan.me/2019/pytorch-parallel-model, DataParallel copy all model and mini-batch for each device. I guess the copied model in each device might not be deleted (nor handled wisely by Pytorch?) for each iteration in the loop (CMIIW).

I change my training stage by eradicating these loops. It works now with 2 * batch_size.