GPU memory leakage during model.eval() step

sunsunsun · May 10, 2020, 8:57pm

Hello! Cant recognise, how to clear gpu memory and what object are stored there. Code sample below. I added comments with my 2 gpu usage after every line of code. As you can see del objects + torch.cuda.empty_cash() works well (not so well, because where is anyway 0.5gb more used, then before…) , but during my evaluation part of training loop I fails.

My main questions:

Why after train part I got 1.2+0.6 gb vs 0.7+0.0 gb before training
Why after eval part empty_cash absolutely fails?

model = torchvision.fcn_resnet50(pretrained=False, progress=False, num_classes=12)
model = torch.nn.DataParallel(model)

### GPU USAGE: 0.0 and 0.0 gb
model.to('cuda:0')
### GPU USAGE 0.7 and 0.0 gb
criterion = torch.nn.CrossEntropyLoss()
### GPU USAGE 0.7 and 0.0 gb
optimizer = torch.optim.Adam(model.parameters(), 5e-4, )
### GPU USAGE 0.7 and 0.0 gb
for epoch in [0]:
    torch.cuda.empty_cache()
    ### GPU USAGE 0.7 and 0.0 gb
    model.train()
    ### GPU USAGE 0.7 and 0.0 gb
    if 1 == 1:
        img, mask = next(iter(loaders['train']))
        ### GPU USAGE 0.7 and 0.0 gb
        img, mask = img.to('cuda:0'), mask.to('cuda:0')
        ### GPU USAGE 0.8 and 0.0 gb
        predicted_mask = model(img)['out']
        ### GPU USAGE 4.8 and 4.6 gb
        loss = criterion(predicted_mask, mask.long())
        ### GPU USAGE 4.8 and 4.6 gb
        optimizer.zero_grad()
        ### GPU USAGE 4.8 and 4.6 gb
        loss.backward()
        ### GPU USAGE 5.4 and 5.0 gb
        optimizer.step()
        ### GPU USAGE 5.4 and 5.0 gb

        del img, mask, predicted_mask
        ### GPU USAGE 5.4 and 5.0 gb
        torch.cuda.empty_cache()
        ### GPU USAGE 1.2 and 0.6 gb

    # start validation part
    model.eval()
    ### GPU USAGE 1.2 and 0.6 gb
    if 1 == 1:
        img, mask = next(iter(loaders['train']))
        ### GPU USAGE 1.2 and 0.6 gb
        img, mask = img.to(args['device']), mask.to(args['device'])
        ### GPU USAGE 1.2 and 0.6 gb
        predicted_mask = model(img)['out']
        ### GPU USAGE 5.1 and 4.6 gb
        loss = criterion(predicted_mask, mask.long())
        ### GPU USAGE 5.2 and 4.6 gb
        del img, mask, predicted_mask
        ### GPU USAGE 5.2 and 4.6 gb
        torch.cuda.empty_cache()
        ### GPU USAGE 5.2 and 4.6 gb

ptrblck · May 11, 2020, 2:34am

Try to wrap your validation loop in with torch.no_grad() as this will avoid storing intermediate tensors, which are required to calculate the gradients during the backward pass.
Currently you are neither deleting loss, which holds a reference to the computation graph and thus all intermediates, nor call loss.backward(), which is correct for the validation loop, but would clear the intermediate tensors.

sunsunsun · May 11, 2020, 11:39am

Could you please explain or give some reference to read about loss.backward() during validation loop? Or you meant that loss.backward() will help me, but that’s incorrect for validation part?

sunsunsun · May 11, 2020, 11:43am

anyway, adding torch.no_grad() context manager fixed my memory issue, thanks a lot!

futscdav · May 11, 2020, 11:56am

What he meant is that by calling loss.backward() you would free the computation graph. But it’s correct not to do that during validation. The reference to the loss object keeps the whole graph alive, because backward could be called.