CUDA out of memory when using model to make predictions

Hello, I am trying to use a trained model to make predictions (batch size of 10) on a test dataset, but my GPU quickly runs out of memory. I think it’s because some unneeded variables/tensors are being held in the GPU, but I am not sure how to free them. I was able to find some forum posts about freeing the total GPU cache, but not something about how to free specific memory used by certain variables. My apologies if this has been asked and resolved before. How should I free the tensors after each iteration in my code below? I am storing the outputs in a list, so once outputs are produced on each iteration they do not need to be in the GPU anymore:


# Note: my batches are 10 images now, so I need to stack 10 before appending to output list

model_outputs = []

# Get test examples
for x in range(len(test_dataset) - 10):
    batch = torch.stack((test_dataset[x][0].float().cuda()
                         , test_dataset[x + 1][0].float().cuda()
                         , test_dataset[x + 2][0].float().cuda()
                         , test_dataset[x + 3][0].float().cuda()
                         , test_dataset[x + 4][0].float().cuda()
                         , test_dataset[x + 5][0].float().cuda()
                         , test_dataset[x + 6][0].float().cuda()
                         , test_dataset[x + 7][0].float().cuda()
                         , test_dataset[x + 8][0].float().cuda()
                         , test_dataset[x + 9][0].float().cuda())).unsqueeze(1) #makes batch of size 10
    outputs = model(batch) #passes batch into model
    _, predicted = torch.max(, 1) #gets the predicted values
    model_outputs.append((predicted[0].cpu(), test_dataset[x][2]))
    model_outputs.append((predicted[1].cpu(), test_dataset[x + 1][2]))
    model_outputs.append((predicted[2].cpu(), test_dataset[x + 2][2]))
    model_outputs.append((predicted[3].cpu(), test_dataset[x + 3][2]))
    model_outputs.append((predicted[4].cpu(), test_dataset[x + 4][2]))
    model_outputs.append((predicted[5].cpu(), test_dataset[x + 5][2]))
    model_outputs.append((predicted[6].cpu(), test_dataset[x + 6][2]))
    model_outputs.append((predicted[7].cpu(), test_dataset[x + 7][2]))
    model_outputs.append((predicted[8].cpu(), test_dataset[x + 8][2]))
    model_outputs.append((predicted[9].cpu(), test_dataset[x + 9][2]))

RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 11.00 GiB total capacity; 4.22 GiB already allocated; 14.30 MiB free; 83.38 MiB cached)

you need to use with torch.no_grad(): in order not to store inner information used for backprop and to allow memory to be freed.


with torch.no_grad():
      output = model(input)

It also may be desirable to use model.eval()



Thanks for your help, it works now! So if I don’t put torch.no_grad(), what extra information will it be storing?

I have used model.eval() in a cell above when loading the model. What is the difference between with torch.no_grad() and model.eval()? What does model.eval() do?

no_grad indicates pytorch you are not interested in saving gradients and computational graph as you are not gonna backpropagate. If you don’t call it, pytorch keeps saving computational graph assuming your model consist in calling those many times itself and keep using more and more memory.

Model.eval just turns off dropout and fixes statistics of batch normalization. You can do a fast check in the forum, both things are widely commented.

1 Like

Thanks, I’ll do some more reading!