Seems like I am getting some extra memory overheads with dropout. Here is a toy code to illustrate the problem.
import torch.nn as nn
import torch.nn.functional as F
import torch
import gc
from py3nvml.py3nvml import *
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
def print_free_memory(point):
info = nvmlDeviceGetMemoryInfo(handle)
print("Used memory: {:10.4f}GB at point {}".format(info.used/(1024**3), point))
class Test(nn.Module):
def __init__(self):
super(Test, self).__init__()
pass
def forward(self, x):
print_free_memory("before dropout")
output = F.dropout(x, training=True)
print(output.shape)
print_free_memory("after dropout")
return output
model = Test().cuda()
def run():
device = torch.device('cuda')
for i in range(1,2):
x = torch.rand(30, 175, 4096).to(device)
out = model(x)
run()
For this run, output is:
Used memory: 0.7822GB at point before dropout
torch.Size([30, 175, 4096])
Used memory: 1.2705GB at point after dropout
AFAIK x
will occupy (30*175*4096*32) / (8*1024*1024) = 82MB of memory
and there is a x.clone()
in dropout so in total it should occupy 82*2=164MB
. But as we can see the difference here is roughly 490MB. Although the difference is not very high here, in my case where I stack multiple layers with each of them having dropout enabled, makes the model go out of memory.
UPDATE:
If I use inplace=True
then there is a slight reduction of used memory after dropout (from 1.2705GB to 1.1885GB
) which is exactly equal to the memory occupied by output
variable.