Run out of GPU memory


I get the cuda runtime error (2) during training. However, it’s strange as the 1st iteration runs ok, but the 2nd iteration returns the runtime error. My model has 32 million parameters, I have 2 nVidia 1080Ti which means 22GB of GPU memory. Feeding images of 256x256x3 with a batch size of 40.

Here’s the code:

for epoch in range(1, epochs):
    train_loss = 0.0
    for i, batch in enumerate(loader):
        ims, masks = batch['image'].cuda(), batch['mask'].cuda().float()

        ims = Variable(ims)
        masks = Variable(masks)


        outputs = net(ims)

        loss = criterion(outputs, masks)

        train_loss +=[0]

I tried adding torch.cuda.empty_cache() and that didn’t help.

Any advice on how to fix this? Why does it run the 1st iteration and run out of GPU memory for the 2nd iteration?

Have you tried reducing the batch size? I’d suggest doing that and then monitoring the GPU memory usage levels just to see if it still happens at the second iteration and if there’s a weird pattern. I haven’t seen this with pytorch, just trying to spur some ideas

I did, but I had to reduce the batch size to an unreasonable number like 11 per GPU, which is too small. I suspect that storing the result of the loss function is taking most of the memory. I’ll take a look as it further, for now I reduced the size of the model by a factor of 4.

I tried to set my variable to volatile and worked for me. I am thinking that maybe the first iteration the model allocate memory to some of variables in your model and does not release memory. At the second iteration , GPU run out of memory because the previous occupied variables.

Here’s the related link:

    ims = Variable(ims,volatile=true)
    masks = Variable(masks,volatile=true)

In this way, training set will not always in gpu memory.

Volatile variables does not require gradient computation by default (check .requires_grad). Volatile property is also transfered to the variables on the upper computational graph. Could you check other variables’ requires_grad property to be sure that training is happening?