Number of allocated tensors starts to grow if 3D grid size is above (110, 110, 110)

weders · February 15, 2020, 12:36pm

Hi,

I am currently working 3D deep learning with 3D convolutions on grids. Now, I encountered the problem that if the 3D grid size is above (110, 110, 110) the number of allocated tensors on the GPU starts to grow when calling the loss.backward() causing a GPU memory leak.

Why is that the case?

An example:

    x1 = self.padding(x)
    x1 = self.kernel_1(x1)
    x1 = self.tanh(x1)

    x1 = torch.cat((x, x1), dim=1)
    
    x1 = self.kernel_2(x1)
    x1 = self.tanh(x1)

    x1 = self.kernel_3(x1)
    x1 = self.tanh(x1)

    x1 = self.kernel_4(x1)
    x1 = self.tanh(x1)

    x1 = self.kernel5(x1)
    x1 = self.tanh(x1)

    return x1

If I return x1 after kernel_3, then the problem does not happen. If I return x1 after kernel_4 or later the problem happens. All kernels are 3D convolutions with 1x1x1 kernels.

Thanks in advance!

ptrblck · February 16, 2020, 12:14am

Are you passing an input of [batch_size, channels, 110, 110, 110] to nn.Conv3d layers or do I misunderstand your question regarding the grid?
Could you post the model definition so that we can reproduce this issue, please?

weders · February 16, 2020, 10:34am

Hi @ptrblck,

Thanks for the fast reply!

Yes, exactly. So the problem of new tensors being allocated starts when increasing the input size from [1, 16, 110, 110, 110] to e.g. [1, 16, 115, 115, 115].

The model definition is:

    self.kernel_1 = torch.nn.Conv3d(16, 16,
                                    kernel_size=3,
                                    stride=1,
                                    padding=0)
    
    self.kernel_2 = torch.nn.Conv3d(32, 16,
                                   kernel_size=1,
                                   stride=1,
                                   padding=0)

    self.kernel_3 = torch.nn.Conv3d(16, 8,
                                   kernel_size=1,
                                   stride=1,
                                   padding=0)

    self.kernel_4 = torch.nn.Conv3d(8, 4,
                                   kernel_size=1,
                                   stride=1,
                                   padding=0)

    self.kernel_5 = torch.nn.Conv3d(4, 1,
                                   kernel_size=1,
                                   stride=1,
                                   padding=0)


    self.padding = torch.nn.ReplicationPad3d(1)

    self.relu = torch.nn.LeakyReLU()
    self.tanh = torch.nn.Tanh()

Thanks!

ptrblck · February 17, 2020, 4:38am

Thanks for the code!
I cannot reproduce a memory leak with a master build + CUDA10.2 or with the 1.4.0 binaries + CUDA10.1.

I’m using your model and this code:

model = MyModel().cuda()
x = torch.randn(1, 16, 115, 115, 115).cuda()
torch.cuda.synchronize()
print('mem allocated ', torch.cuda.memory_allocated()/1024**3)

for idx in range(100):
    output  = model(x)
    print('iter {} mem allocated {}'.format(
        idx, torch.cuda.memory_allocated()/1024**3))

Do you see a memory leak using my test code?

PS: I tested the volumetric shapes of 110, 115, and 125.