Thanks to @ptrblck for your great solution answer.
I just want to remind that it also requires you to move your predicted results tensor and the ground-truth target tensor into the same GPU because of the loss function.
When the loss function calculates the difference between target and prediction tensor, it needs to have both of them on the same GPU.
hi thanks for your solution, but when i do this I get error in my loss function :
buffer[torch.eq(target, -1.)] = 0
RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1512946747676/work/torch/lib/THC/generated/…/generic/THCTensorMasked.cu:13
this is not an error in my code but an error popping up after using parallelism of data ( as i tried to run my less intensive code both with and without data parallelism and it throws up same error while using it with data parallelsim)
my model is intensive and I have 2 gpu’s 12206MiB each. I just need to split my model to use both gpu’s while training as well as testing.
This makes sure that the outputs will be gathered on GPU0, which will calculate the loss and scatter it to the replicas again.
The general method is beautifully explained in this blog post.
Not necessarily, but it depends on your use case of course.
You can pass all model parameters to a single optimizer, define per-parameter options or use different optimizers. The split in submodules does not limit your options and you could handle the submodules as single layers.
However, my question is not about the ‘unbalanced GPU usage’ , is about the GPU memory allocated.
"
I notice that when I split the whole model in 4 gpus and do forward/backward, the GPU memory on the fisrt GPU cost much more than it should be. For example, if the whole model cost 12GB on a single GPU, when split it to four GPUs, the first GPU cost 11GB and the sum of others cost about 11GB.
Is there an explaination for how does the GPU memory be malloced when using multiple GPUs for model parallelism.
"
Still not knowing how to deal with the problem, hope you can help me!
Thanks for the clarification. I now understand you are sharding your model, i.e. some layers are in GPU0, the next ones on GPU1, etc. Using this approach the memory usage of all GPUs is approx. 2x the memory usage of the same model on a single GPU using the same setup (e.g. same batch size etc.).
If I understood your issue correctly, could you post a (small) reproducible code snippet so that we could have a look?
yeah, I just realized that I misunderstood @Ashwin_Raju so thanks for pointing it out.
Model sharding or model parallelism refers to splitting the model onto several GPUs, so the forward and backward graph are going through all devices.
You could use it, e.g. if you have a huge model and can use multiple GPUs.
A simple dummy case is given in this previous post.
I think there is a typo in if statement in __ init __ function.
I guess the second one is wrong and the submodule number should be 2: self.large_submodule2.cuda(1)
Hi thank you for your example!
But It does not work…with error
RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 0 does not equal 1 (while checking arguments for cudnn_convolution)
with code
import torch
import torch.nn as nn
class SubModule(nn.Module):
def __init__(self, in_channels, out_channels):
super(SubModule, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, 1, 1)
def forward(self, x):
print('SubModule, device: {}, shape: {}\n'.format(x.device, x.shape))
x = self.conv1(x)
return x
class MyModel(nn.Module):
def __init__(self, split_gpus, parallel):
super(MyModel, self).__init__()
self.module1 = SubModule(3, 6)
self.module2 = SubModule(6, 1)
self.split_gpus = split_gpus
self.parallel = parallel
if self.split_gpus and self.parallel:
self.module1 = self.module1.to('cuda:0')
self.module2 = self.module2.to('cuda:1')
def forward(self, x):
print('Input: device {}, shape {}\n'.format(x.device, x.shape))
x = self.module1(x)
print('After module1: device {}, shape {}\n'.format(x.device, x.shape))
x = self.module2(x)
print('After module2: device {}, shape {}\n'.format(x.device, x.shape))
return x
model = MyModel(split_gpus=True, parallel=True)
x = torch.randn(16, 3, 24, 24).to('cuda:0')
output = model(x)
Hi Guys,
I wonder how everything you discussed in relation to model sharding and data parallelism will be affected when using NVLink (for example, connecting two GPUs with an NVLink bridge). Will it just make things faster, or will it open new scenarios?