Understanding CUDA out of memory error

Hello, I have the following code snippet which gives me ‘cuda out of memory error’ after several passes of the for loop using batch size (b) 50 or above. However if I use batch size less than 40, it seems run fine. My question is in case of larger batch size (>=50), why it is giving error after several passes? Is there any memory overhead accumulating inside the loop?

model = Video3DCNN('resnet34',10, sample_size=224)
model.to('cuda:0')
optimizer1 = optim.Adam(model.parameters(), lr=0.0005)
loss = nn.CrossEntropyLoss()
b = 50
ep_loss = 0.
for i in range(100):
  torch.cuda.empty_cache()
  logits = model(torch.randn(b, 3, 10, 224, 224).cuda())
  label = torch.randint(high=1000, size=(b,)).cuda()
  loss_ = loss(logits, label)
  optimizer1.zero_grad()
  print (i, loss_.item())
  print (logits.size())
  loss_.backward()
  optimizer1.step()
  ep_loss += loss_.detach().item()

Output for batch size b==50,
/home/ahosain/workspace/cslr/resnet_3d.py:152: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
m.weight = nn.init.kaiming_normal(m.weight, mode=‘fan_out’)
0 6.941838264465332
torch.Size([50, 1000])
1 7.274449348449707
torch.Size([50, 1000])
2 7.0289177894592285
torch.Size([50, 1000])
3 7.262955188751221
torch.Size([50, 1000])
Traceback (most recent call last):
File “test.py”, line 38, in
loss_.backward()
File “/home/ahosain/workspace/dl_env/lib/python3.6/site-packages/torch/tensor.py”, line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/ahosain/workspace/dl_env/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 10.76 GiB total capacity; 5.23 GiB already allocated; 1.48 GiB free; 3.19 GiB cached)

Could you plot the GPU memory usage along your script running time? There’s a torch.cuda function for that here. (In particular, it would be good to also reset the value at each iteration, as stated in the doc).

It’s weird though that it crashes at step 3-4 in your batch-size 50 run and not at all for your batch-size 40 run…

Hi, Thanks for the reply and the function reference. I used it and found something confusing. I have two RTX 2080 Ti gpus. The every third number shows the cache size at that point using the function torch.cuda.max_memory_cached ( )

When I use gpu 0, get the following output,

0 6.981476306915283
torch.Size([50, 1000])
10034.0
1 6.936072826385498
torch.Size([50, 1000])
10034.0
2 7.297829627990723
torch.Size([50, 1000])
10034.0
3 7.23666524887085
torch.Size([50, 1000])
Traceback (most recent call last):
  File "test.py", line 44, in <module>
    loss_.backward()
  File "/home/ahosain/workspace/dl_env/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ahosain/workspace/dl_env/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 10.76 GiB total capacity; 5.23 GiB already allocated; 1.48 GiB free; 3.19 GiB cached)

In case of gpu 1, it seems running fine with following output,

0 7.1220622062683105
torch.Size([50, 1000])
8964.0
1 7.066858291625977
torch.Size([50, 1000])
9512.0
2 7.112499237060547
torch.Size([50, 1000])
9512.0
3 7.206641674041748
torch.Size([50, 1000])
9536.0
4 7.3534345626831055
torch.Size([50, 1000])
9536.0
5 7.341899871826172
torch.Size([50, 1000])
9536.0
6 7.346314907073975
torch.Size([50, 1000])
9536.0
7 7.240469932556152
torch.Size([50, 1000])
9536.0
.
.
.
.

:thinking: that’s odd, you don’t even use DataParallel in your code sample, and you empty the cache at each iteration…

I’ve seen something in your original code, maybe instead of calling .cuda() on your logits and label tensors, try calling .to('cuda:0'), the same as your model. I have no idea if that would change anything, since it would probably not work if the tensors were on a different GPU than the model, but it’s worth trying.

1 Like