Doing the backpropogation on CPU

I’m training a 3D U-Net-like architecture (with a patch size of 128^3), on a Tesla V100 16GB, which runs out of memory in the loss.backward() step. The forward pass goes through, but the next line which is loss.backward() throws the following CUDA OOM error :

Traceback (most recent call last):
  File "", line 200, in <module>
  File "/cbica/external/python/anaconda/3/envs/pytorch/1.0/lib/python3.6/site-packages/torch/", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/cbica/external/python/anaconda/3/envs/pytorch/1.0/lib/python3.6/site-packages/torch/autograd/", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 15.78 GiB total capacity; 14.70 GiB already allocated; 74.62 MiB free; 54.63 MiB cached)

The error shows loss.cpu().backward() since I try pushing the loss to CPU for backprop but I still get the error of CUDA OOM.

The crux of my training script is given below:

for ep in range(num_epochs):
    start = time.time()
    for batch_idx, (subject) in enumerate(train_loader):
        # Load the subject and its ground truth

        image = subject['image']
        mask = subject['gt']
        # Loading images into the GPU and ignoring the affine
        image, mask = image.float().cuda(), mask.float().cuda()
        #Variable class is deprecated - parameteters to be given are the tensor, whether it requires grad and the function that created it

        image, mask = Variable(image, requires_grad = True), Variable(mask, requires_grad = True)

        # Making sure that the optimizer has been reset
        # Forward Propagation to get the output from the models
        output = model(image.float())
        # Computing the loss
        loss = loss_fn(output.cpu().double(), mask.cpu().double(), n_classes)
        # Back Propagation for model to learn
        loss = loss.cpu()
        #Updating the weight values
        #Pushing the dice to the cpu and only taking its value
        curr_loss = MCD_loss(output.double(), mask.double(), n_classes).cpu().data.item()
        # Computing the average loss
        average_loss = total_loss/(batch_idx + 1)
        #Computing the dice score 
        curr_dice = 1 - curr_loss
        #Computing the total dice
        total_dice+= curr_dice
        #Computing the average dice
        average_dice = total_dice/(batch_idx + 1)

Any information would be of great help. Thanks in advance.


The backward operation is always performed on the same device where the forward was performed. So moving the loss to the cpu does not force the backward to be computed on the cpu.
There is no way to do this at the moment.

The usual way to deal with this is to reduce the batch size. Potentially doing 2 forward/backward before doing an optimizer.step to double the effective batch size.

1 Like

Thank you for the information!

Also, torch.utils.checkpoint might be useful to trade compute for memory.

1 Like