CUDA out of memory in the validation section

MayThet · June 8, 2023, 8:29am

I encountered a CUDA out of memory error according to the attached code.
The error line code is

 Vrestored = model_restoration(Vinput_)

The error specifically occurred during the validation phase, which was placed within the training step. To address the issue, I attempted to delete some variables in the training part and clear the memory cache. However, these attempts did not resolve the problem. I also tried reducing the batch size, using the garbage collector, and utilizing the torch.no_grad() approach, but none of these methods were successful.

 for epoch in range(start_epoch, opt.OPTIM.NUM_EPOCHS + 1):
        epoch_start_time = time.time()
        epoch_loss = 0
        train_id = 1

        model_restoration.train()
        for i, data in enumerate(tqdm(train_loader), 0):

            # zero_grad
            for param in model_restoration.parameters():
                param.grad = None

            with torch.no_grad():

                target = data[0].to('cuda')
                input_ = data[1].to('cuda')

            if epoch > 5:
                target, input_ = mixup.aug(target, input_)
            with torch.no_grad():
                restored = model_restoration(input_)

            # Compute loss at each stage
            loss = np.sum([criterion(torch.clamp(restored[j], 0, 1), target) for j in range(len(restored))])
            loss.requires_grad = True
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

            del target, input_, restored
            torch.cuda.empty_cache()

            #### Evaluation ####
            if i % eval_now == 0 and i > 0 and (epoch in [1, 25, 45] or epoch > 60):
                model_restoration.eval()
                psnr_val_rgb = []
                for ii, data_val in enumerate((val_loader), 0):
                    Vtarget = data_val[0].to('cuda')
                    Vinput_ = data_val[1].to('cuda')

                    with torch.no_grad():
                        Vrestored = model_restoration(Vinput_)
                    Vrestored = Vrestored[0]

                    for res, tar in zip(Vrestored, Vtarget):
                        psnr_val_rgb.append(utils.torchPSNR(res, tar))

                del Vtarget, Vinput_, Vrestored
                torch.cuda.empty_cache()

                psnr_val_rgb = torch.stack(psnr_val_rgb).mean().item()

ptrblck · June 8, 2023, 9:12am

Unrelated to the OOM error, but your training code looks already broken.

These lines of code:

with torch.no_grad():
    restored = model_restoration(input_)

# Compute loss at each stage
loss = np.sum([criterion(torch.clamp(restored[j], 0, 1), target) for j in range(len(restored))])
loss.requires_grad = True
loss.backward()
optimizer.step()

will not update the model since the forward pass was performed in a no_grad context.
Setting the .requires_grad attribute of the loss tensor does not fix this.
Also, you are using numpy operations, which Autograd would not understand, so stick to PyTorch ops.

For the OOM issue I would check how large the validation batch size is and would reduce it if needed.

MayThet · June 9, 2023, 10:15am

Despite reducing the validation batch size to 8 and making relevant code modifications according to the attached code. I think the np.sum operation make the longer training time. How can I solve this problem?

 ```
    for i, data in enumerate(tqdm(train_loader), 0):

        # zero_grad
        for param in model_restoration.parameters():
            param.grad = None

        target = data[0].to(device)
        input_ = data[1].to(device)

        if epoch > 5:
            target, input_ = mixup.aug(target, input_)

        with torch.no_grad():
            restored = model_restoration(input_)

        loss = np.sum(
            [criterion(torch.clamp(restored[j], 0, 1), target).cpu().numpy() for j in range(len(restored))])
        loss = torch.tensor(loss, requires_grad=True)
        loss.backward()  # works
        optimizer.step()
        scheduler.step()
        epoch_loss += loss.item()

ptrblck · June 9, 2023, 3:24pm

You are synchronizing your code by moving data to the CPU before you are detaching the computation graph again and use numpy operations, which Autograd also won’t understand.
Recreating the tensor afterwards via:

loss = torch.tensor(loss, requires_grad=True)

won’t re-attach it to the computation graph as already mentioned.

To avoid the slowdown, don’t move the data to the CPU and use PyTorch operations instead.