GPU memory doesn't released

masawdah · August 11, 2021, 11:42pm

I’m trying to send my CNN model to the GPU device, but each time I run model = model.to(device) I got an error “RuntimeError: CUDA error: out of memory”.

I tried to use

import torch
torch.cuda.empty_cache()

but that did not work, I’ve restarted the Kernal but that didn’t solve the problem. I checked the free/used memory, it looks full, I’ve tried to clean the memory using torch.cuda.empty_cache() that did not work, the below image shows the free/used memory.

I don’t have any idea why this error pops-up even I don’t send or train any model on the GPU.

gphilip · August 12, 2021, 1:41am

Is someone else using the same machine to run GPU-intensive tasks?
Are you running another GPU-intensive task (e.g.: a game, or 3D-rendering)?

arya47 · August 12, 2021, 3:20am

Try these following steps to figure out where the problem is:

Use the model with a simple most basic training loop. If the problem is solved check your training loop for any accumulation or (maybe make sure you’re not calling model.to(device) for each epoch)
If the error persists try using a different model. To find out if the error is in your model. (Has happened to me before and was due to some linear layers)

Simple training loop: (no autograd or scalers)


def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        pred = model(X)
        loss = loss_fn(pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
        if batch%100 ==0:
            loss, current = loss.item(), batch*len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
print("Done!")

masawdah · August 12, 2021, 8:57am

No, it’s an online platform and I don’t run anything else

masawdah · August 12, 2021, 8:58am

Thanks for reply, the problem with sending the model to GPU, I didn’t reach training step

arya47 · August 12, 2021, 11:38am

Can you share the model?

masawdah · August 12, 2021, 11:54am

Thanks, It was a problem within the platform and they’re working to fix it