How to more aggressively release GPU memory?

zoombinis · July 14, 2021, 6:21pm

Hi, I am using the latest pytorch master branch and am iteratively sending images to a neural net and saving the output predictions to a video file.

Due to the iterative nature, after a few runs my puny GPU runs out of memory, so I tried adding this to each iteration to free caches:

torch.cuda.empty_cache()

But that had no effect. The only solution that worked is very slowly recreating the model in each for loop iteration:

while True:
    model = ENet(num_classes).to(device)
    optimizer = optim.Adam(model.parameters())
    model = utils.load_checkpoint(model, optimizer, model_path, model_name)[0]
    model.eval()

   # unrelated code then pulls camera images as my "input"
   with torch.no_grad():
        predictions = model(input)

   # Predictions is one-hot encoded with "num_classes" channels.
   # Convert it to a single int using the indices where the maximum (1) occurs
   _, predictions = torch.max(predictions.data, 1)

   label_to_rgb = transforms.Compose([
        ext_transforms.LongTensorToRGBPIL(class_encoding),
        transforms.ToTensor()
   ])
   color_predictions = utils.batch_transform(predictions.cpu(), label_to_rgb)

   fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 7))
   ax1.imshow(np.transpose(torchvision.utils.make_grid(input.data.cpu()).numpy(), (1, 2, 0)))
   ax2.imshow(np.transpose(torchvision.utils.make_grid(color_predictions).numpy(), (1, 2, 0)))

So how can I aggressively drop all consumed memory without having to recreate the model from scratch. Could this be indicative of a memory leak?

sbhatti · July 14, 2021, 6:43pm

Try running gc.collect() before torch.cuda.empty_cache(), this should free up more memory.

zoombinis · July 14, 2021, 7:26pm

Thank you but unfortunately I still run out of GPU memory after 10 or 20 iterations.

sbhatti · July 14, 2021, 7:43pm

Try adding del predictions (Or any other variable you don’t need that’s stored on the GPU.) after torch.cuda.empty_cache().

zoombinis · July 14, 2021, 8:00pm

Hi and thank you but that doesn’t solve it. Why does my approach by recreating the model fix the issue? Could there be some state/cache the model is holding onto?

sbhatti · July 14, 2021, 8:17pm

TBH, I think that loading the model iteratively should be worse since you are creating multiple models, and that should crash your gpu…

zoombinis · July 15, 2021, 2:09pm

Thanks, I’ll try to file a bug.

I’m having similar issues with attribution, I am not able to uncomment below without running out of memory, even with gc:

    # following the tutorial on attribution for semantic segmentation
    def save_attribution(label, target):
        lc_attr = layer_cond.attribute(input, target=target, n_steps=5, internal_batch_size=1)
        fig, ax = viz.visualize_image_attr_multiple(
            lc_attr[0].cpu().permute(1,2,0).detach().numpy(),
            original_image=orig,
            signs=["positive", "negative"],
            methods=["blended_heat_map", "blended_heat_map"],
            use_pyplot=False
        )
        plt.savefig(f'plot_{label}.png', bbox_inches='tight')
        del lc_attr, fig, ax
        torch.cuda.empty_cache()
        gc.collect()


    save_attribution('smooth', 1)
    # save_attribution('grass', 2)
    # save_attribution('rough', 3)

sbhatti · July 15, 2021, 4:13pm

Sure, try creating a minimal implementation no logging, no printing, etc. Also could you share your original implementation?

zoombinis · July 15, 2021, 5:04pm

I was able to resolve the attribution OOM exception by recreating the model before each call to attribute.

So definitely feeling leaky to me, I’ll try to summarize this in an issue soon