Cuda Out of memory when model training started

When I tried to run my training loop

iters=0
lossesTrain = []
lossesTest = []


print('Training Started!')

torch.backends.cudnn.benchmark = True # Optimizes cudnn
for epochs in range(num_epochs):
    for i,(c, n) in enumerate(trainLoader):
        
        #converting images and labels to variables so as to calculate gradient 
        if torch.cuda.is_available():
            clean = Variable(c.cuda())
            noisy = Variable(n.cuda())
        else:
            clean = Variable(c)
            noisy = Variable(n)
            
            
        #clearing gradient buffers
        optimizer.zero_grad()

        #finding outputs
        output = model(noisy)

        #calculating losses
        loss=criterion(output,clean)

        #backpropagating the loss
        loss.backward()

        #updating the parameters
        optimizer.step()

        #updating counter
        iters+=1
        
        print(i)
        if iters % 25 == 0:

            for c,n in testLoader:
                cleanT = Variable(c)
                noisyT = Variable(n)
                if torch.cuda.is_available():
                    cleanT = Variable(c.cuda())
                    noisyT = Variable(n.cuda())
                else:
                    cleanT = Variable(c)
                    noisyT = Variable(n)
                
                outputT = model(noisyT)
                
                lossesTest.append(criterion(outputT, cleanT).detach())
                lossesTrain.append(criterion(output, clean).detach())
                
                print('Epoch: {}, Iter: {}, lossesTrain: {}, lossesTest: {}'.format(epochs, iters, sum(lossesTrain[-1]), sum(lossesTest[-1])))
                
#             print(plt.imshow(copy.reshape((512,512,-1)).detach().numpy()))
print('Training Done!! Save the model')

I got this as a RunTimeError:-
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 4.00 GiB total capacity; 1.13 GiB already allocated; 212.59 MiB free; 2.75 GiB reserved in total by PyTorch)

Any ideas on what could be done to avoid this and why is this happening?

Try to reduce the batch size and see if it is able to train. The error seems to indicate you do not have sufficient memory but it would be better to confirm that. Try and run your code with a smaller batch size

I halved my batchSize, but I’m still getting the same error message.

Maybe your model is itself too big. I would run the model with batch size of 1 and see how much memory the model itself needs. That gives us an idea about how much data we can upload without OOM error

My model has around 195k trainable parameters. So i dont think that model size would be an issue.

Does it work with a smaller batch size like 2, 4, 8 etc?

I’ll have to try with these values. I’ll text back here with results

1 Like

Even with batchSize of 1… same error

Let us use this to get the model size because if batch size is 1 and still you run into the issue, it is bad. Also, most likely you should be able to run training for a few iterations and then get OOM becuase you are also putting val set onto GPU along with the train. But first thing would be to test the model size