Split a convolution on two GPUs and run out of memory while training

Hi, I am new to pytorch. And I have a problem. In a deep learning network, I split a convolution into two GPU operations, but GPU memory consumption increases while training, and then it reports an error “RuntimeError: CUDA error: out of memory”

I tried to use del loss and torch.cuda.empty_cache() but it never works…Could you please help me?

This is the code snippet related to the conv part

class Conv2dBlock_Multi(nn.Module):
def init(self, input_dim ,output_dim, kernel_size, stride,
padding=0, norm=‘none’, activation=‘relu’, pad_type=‘zero’):
super(Conv2dBlock_Multi, self).init()
self.conv = Conv2d_Multi(input_dim, output_dim, kernel_size, stride, bias=True)

def forward(self, x):
    x = self.conv(self.pad(x))
    return x

class Conv2d_Multi(nn.Module):
def init(self,input_dim, output_dim, kernel_size, stride, bias=True):
super(Conv2d_Multi, self).init()
self.conv1 = nn.Conv2d(input_dim//2, output_dim//2, kernel_size, stride, bias=True)
self.conv2 = nn.Conv2d(input_dim - input_dim//2,output_dim - output_dim//2, kernel_size, stride, bias=True)

def forward(self, x):
    x1, x2 = x.split([x.size()[1]//2, x.size()[1]-x.size()[1]//2], dim=1)
    self.conv1 = nn.DataParallel(self.conv1, device_ids=[0]).to('cuda:0')
    self.conv2 = nn.DataParallel(self.conv2, device_ids=[1]).to('cuda:1')
    x1_out = self.conv1(x1)
    x2_out = self.conv2(x2)
    x2_out = x2_out.to('cuda:0')
    x = torch.cat([x1_out,x2_out],dim = 1)

    return x

Maybe you should check whether there is free GPU for you.

Thank you for your reply, my GPU storage space is 16000M, but still can’t run. If I operate it on one GPU, I only need 6000M of storage space. Could it be that nn.DataParallel has an error in backpropagation?

at every iteration you are wrapping your convs with another DataParallel, so you are doing 2^num_iter splits…

Thanks for all the reply
I solve this problem finally…

1 Like