How to release GPU memory for a sequence of predictions?

I am trying to use one pretrained model to generate a sequence of predictions for long video. The model is supposed to be frozen during the inference.
The pseudo code is like this:

model = torch.load("pretrained_model.pyth")

for one_video in MyDataloader:
    all_my_pred_list = []
    for vid_seg in one_video:
        pred = model(vid_seg)
        all_my_pred_list.append(pred)
        del vid_seg

    # Other routine that consumes all_my_pred_list:  for example  video_level_pred =  average(all_my_pred_list)

    for ele in all_my_pred_list:
        del ele

The problem with the above code is that, when the video is very long, it will run out of gpu memory.
What is the best way to release GPU memory during the above inference without destroying the loaded model?

Warp the forward pass into a with torch.no_grad() or with torch.inference_mode() guard which avoids storing the intermediate forward activations, which would be needed for the gradient computation.
Currently you are not only storing pred but the entire computation graph including all intermediate tensors unless you have disabled gradient calculation globally.

1 Like

Hi @ptrblck, thank you so much for your response. Yes, I forgot to perform model.eval() in the above exemplary code. My situation was that, even if I did model.eval() earlier, the GPU memory is still large because I am doing multiple inference (a sequence of predictions generated for one single input video). So what can I do to further release GPU memories? I am imagining that I should “free something” inside the inner most for-loop: Now I am doing “del vid_seg”, but I am not sure how to safely remove some temporary memories introduced by the model.

model.eval() won’t reduce the memory usage but will change the behavior of some modules: e.g. dropout will be disabled and batchnorm layers will use their running stats to normalize the input activation.
Use the already mentioned guards instead.

1 Like