Cuda Out of memory when model training started

bolt25 · June 12, 2020, 2:18pm

When I tried to run my training loop

iters=0
lossesTrain = []
lossesTest = []


print('Training Started!')

torch.backends.cudnn.benchmark = True # Optimizes cudnn
for epochs in range(num_epochs):
    for i,(c, n) in enumerate(trainLoader):
        
        #converting images and labels to variables so as to calculate gradient 
        if torch.cuda.is_available():
            clean = Variable(c.cuda())
            noisy = Variable(n.cuda())
        else:
            clean = Variable(c)
            noisy = Variable(n)
            
            
        #clearing gradient buffers
        optimizer.zero_grad()

        #finding outputs
        output = model(noisy)

        #calculating losses
        loss=criterion(output,clean)

        #backpropagating the loss
        loss.backward()

        #updating the parameters
        optimizer.step()

        #updating counter
        iters+=1
        
        print(i)
        if iters % 25 == 0:

            for c,n in testLoader:
                cleanT = Variable(c)
                noisyT = Variable(n)
                if torch.cuda.is_available():
                    cleanT = Variable(c.cuda())
                    noisyT = Variable(n.cuda())
                else:
                    cleanT = Variable(c)
                    noisyT = Variable(n)
                
                outputT = model(noisyT)
                
                lossesTest.append(criterion(outputT, cleanT).detach())
                lossesTrain.append(criterion(output, clean).detach())
                
                print('Epoch: {}, Iter: {}, lossesTrain: {}, lossesTest: {}'.format(epochs, iters, sum(lossesTrain[-1]), sum(lossesTest[-1])))
                
#             print(plt.imshow(copy.reshape((512,512,-1)).detach().numpy()))
print('Training Done!! Save the model')

I got this as a RunTimeError:-
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 4.00 GiB total capacity; 1.13 GiB already allocated; 212.59 MiB free; 2.75 GiB reserved in total by PyTorch)

Any ideas on what could be done to avoid this and why is this happening?

chinmay5 · June 12, 2020, 2:22pm

Try to reduce the batch size and see if it is able to train. The error seems to indicate you do not have sufficient memory but it would be better to confirm that. Try and run your code with a smaller batch size

bolt25 · June 12, 2020, 2:25pm

I halved my batchSize, but I’m still getting the same error message.

chinmay5 · June 12, 2020, 2:33pm

Maybe your model is itself too big. I would run the model with batch size of 1 and see how much memory the model itself needs. That gives us an idea about how much data we can upload without OOM error

bolt25 · June 12, 2020, 2:45pm

My model has around 195k trainable parameters. So i dont think that model size would be an issue.

chinmay5 · June 12, 2020, 3:05pm

Does it work with a smaller batch size like 2, 4, 8 etc?

bolt25 · June 12, 2020, 3:15pm

I’ll have to try with these values. I’ll text back here with results

bolt25 · June 12, 2020, 4:23pm

Even with batchSize of 1… same error

chinmay5 · June 12, 2020, 6:23pm

Let us use this to get the model size because if batch size is 1 and still you run into the issue, it is bad. Also, most likely you should be able to run training for a few iterations and then get OOM becuase you are also putting val set onto GPU along with the train. But first thing would be to test the model size