Dropout eating up lot of memory

Seems like I am getting some extra memory overheads with dropout. Here is a toy code to illustrate the problem.

import torch.nn as nn
import torch.nn.functional as F
import torch
import gc

from py3nvml.py3nvml import *
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
def print_free_memory(point):
  info = nvmlDeviceGetMemoryInfo(handle)
  print("Used memory: {:10.4f}GB at point {}".format(info.used/(1024**3), point))

class Test(nn.Module):
    def __init__(self):
        super(Test, self).__init__()
        pass
    def forward(self, x):
        print_free_memory("before dropout")
        output = F.dropout(x, training=True)
        print(output.shape)
        print_free_memory("after dropout")
        return output

model = Test().cuda()

def run():
  device = torch.device('cuda')
  for i in range(1,2):
    x = torch.rand(30, 175, 4096).to(device)
    out = model(x)

run()

For this run, output is:

Used memory:     0.7822GB at point before dropout
torch.Size([30, 175, 4096])
Used memory:     1.2705GB at point after dropout

AFAIK x will occupy (30*175*4096*32) / (8*1024*1024) = 82MB of memory and there is a x.clone() in dropout so in total it should occupy 82*2=164MB. But as we can see the difference here is roughly 490MB. Although the difference is not very high here, in my case where I stack multiple layers with each of them having dropout enabled, makes the model go out of memory.

UPDATE:
If I use inplace=True then there is a slight reduction of used memory after dropout (from 1.2705GB to 1.1885GB) which is exactly equal to the memory occupied by output variable.

I think you storage in global memory within training=True, try without this flag and let see.

If I do training=False then dropout doesn’t work. This line returns back the original tensor if flag is not enabled.

Hey, sorry because you use the in-function version, try that:

import torch.nn as nn
drop = nn.Dropout2d()

then, you use drop as a function like: drop(…)

By the way which CuDNN version did you use?

I’ve tried your code with print(torch.cuda.memory_allocated()) instead of your nvml functions, since I’m not familiar with them.
It seems the code uses approx. 82MB:

class Test(nn.Module):
    def __init__(self):
        super(Test, self).__init__()

    def forward(self, x):
        print(torch.cuda.memory_allocated() / 1024**2)
        output = F.dropout(x, training=True)
        print(torch.cuda.memory_allocated() / 1024**2)
        return output


def run():
  for i in range(1,2):
    x = torch.rand(30, 175, 4096).to(device)
    out = model(x)


device = torch.device('cuda')
model = Test().to(device)

run()
> 165
> 247
3 Likes