RuntimeError: CUDA out of memory. Tried to allocate X GB - memory balooning during loss.backward()


I’m trying to fit a deeper model than my current one into gpu and run it. My current autoencoder model takes 4 GB of gpu memory as shown to me by nvidia-smi and GPUtil. The newer model I’m trying to fit takes no more than 6 GB as shown by nvidia-smi.

Everything seems to be running smoothly until my code reaches the loss.backward() statement. With 4 GB model and 3 GB of input, reconstructed and corrupted images tensors, GPU memory utilization stands around 7 GB just before that. Then when the code reaches loss.backward(), GPU memory usage balloons to 23.95 GB. And it stays there until the end of all epochs.

I have total 24 GB of GPU memory available. Hence, if I try to increase the number of filters in my CNN autoencoder model or increase number of layers, I get the following error:

RuntimeError: CUDA out of memory. Tried to allocate 3.74 GiB (GPU 2; 23.65 GiB total capacity; 22.79 GiB already allocated; 31.50 MiB free; 22.83 GiB reserved in total by PyTorch)

I can’t seem to figure out why loss.backward() would take such large amount of memory. I’m using Adam optimizer. One more interesting thing is that my current batch size is 100. If I increase it to 500 there doesn’t seem to be any large memory problem, hardly 1-2 GB of memory is used more. Hence the main problem causing memory ballooning is loss.backward(). I’m making sure that I clear past gradients with model.zero_grad(), I also delete the tensors from GPU after they’re useless and also perform torch.cuda.empty_cache(). Nothing seems to be making an impact.

One last thing I’d like to add is that I parallelized my autoencoder model on three similar GPUs, breaking it into three parts. Then when I do loss.backward() each GPU seems to be using 17 GB of storage. However as soon as I increase number of filters in the model by even a modest bit, I again get the memory errors. I’m using imagenet images as data (2242243).
Here’s my autoencoder model:
(conv1): Conv2d(3, 256, kernel_size=(6, 6), stride=(4, 4), padding=(1, 1))
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 512, kernel_size=(6, 6), stride=(4, 4), padding=(1, 1))
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 10000, kernel_size=(14, 14), stride=(1, 1))
(bn3): BatchNorm2d(10000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(deconv1): ConvTranspose2d(10000, 512, kernel_size=(14, 14), stride=(1, 1), bias=False)
(de_bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(deconv2): ConvTranspose2d(512, 256, kernel_size=(8, 8), stride=(4, 4), padding=(2, 2), bias=False)
(de_bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(deconv3): ConvTranspose2d(256, 3, kernel_size=(8, 8), stride=(4, 4), padding=(2, 2), bias=False)

Here’s the training loop:


total_loss = []
total_loss_true = []
for epoch in range(max_epoch):
    print("Epoch: ", epoch)
    epoch_loss = []
    epoch_loss_true = []
    progress = tqdm(total=num_batches, desc='epoch % 3d' % (epoch + 1))
    for i in range(num_batches):
        print("Batch: ", i)
        X_clean = torch.zeros((batch_size, ch, h, w))
        X_corrupted = torch.zeros((batch_size, ch, h, w))
        # targets = torch.zeros(batch_size)
        for j in range(batch_size):
            tens, blah = processed_loader.dataset[i * batch_size + j]
            X_clean[j] = preprocessing(tens)
            # Corrupt image
            opened_img, blah = processed_loader.dataset[i * batch_size + j]
            opened_img_np = np.array(opened_img)
            cor_img_float64 = impulse_noise(opened_img, severity)
            # Very Important -> Conversion from float64 to uint8
            cor_img_uint8 = cor_img_float64.astype('uint8')
            corrupted_tensor = convert_to_tensor(cor_img_uint8)
            norm_corrupted_tensor = normalization(corrupted_tensor)
            X_corrupted[j] = norm_corrupted_tensor

        ################## Get Training & Traget Dataset ##################
        X_clean = Variable(X_clean.type(torch.FloatTensor)).to(device)
        X_corrupted = Variable(X_corrupted.type(torch.FloatTensor)).to(device)

        ################## Train and Backpropagation ##################

        codes, rec_X = autoencoder_model(X_corrupted)

        loss = calculate_loss(true_data=X_corrupted, pred_data=rec_X)
        loss_true = calculate_loss(true_data=X_clean, pred_data=rec_X)
        print("True Loss: ", loss_true)

        del X_clean
        del X_corrupted
        del codes
        del rec_X



        progress.set_postfix({'loss': loss.item()})


The initial increase of the memory usage might be just temporary and afterwards is cached to avoid memory allocations.
Could you check the allocated memory via torch.cuda.memory_allocated(), please?
This would give you an idea how much memory is really used to store all tensors and parameters.

Also, are you using cudnn.benchmark? If so, could you disable it for debugging purposes and also create a run with the native CUDA implementations by disabling cudnn completely via torch.backends.cudnn.enabled = False?

It would be interesting to see, how much memory is used in which part of your code.

PS: Are you using an older PyTorch version, as you are using Variables, which were deprecated in 0.4?
If so, could you update to the latest stable version (1.5)?

You may also want to experiment with

Large Model Support allows you to overcommit GPU memory by using host memory as a swap space for inactive tensors.

1 Like

My model is working now. I believe due to large convolution filters (14 * 14) and channel sizes (512 and 10000) in my architecture, parameter count was too high and the GPU memory was blowing up. I still don’t understand why memory blew up exactly after loss.backward() method but after inspecting tensor sizes and model parameter count, GPU memory utilization seems reasonable.

Lastly I’m closing this discussion since solution is simply to reduce filter size to keep parameter count at conv3 low. Thanks for all your help.