How to perform one backprop after two feedforward passes?

stark · April 22, 2022, 7:37am

Hi everyone,

Currently, I can’t feed a batch more than 64 samples (memory constraints on single GPU) for training . But my custom loss calculation requires at least 128 predictions since I’ll be using PCA to reduce feature dimensions to 128 and it won’t work when samples < components.

The routine way of getting model predictions is:

for batch_idx, data in enumerate(dataloader['train']):
    batch, lbl = data[0], data[1]
    out = model(batch)
    loss = custom_loss(out, lbl)
    loss.backward()

I’m trying collecting model outputs for two training batches (64×2=128) before continuing with the loss calculations i.e.,

out_, lbl_ = [], []
for batch_idx, data in enumerate(dataloader['train']):
    batch, lbl = data[0], data[1]
    out_.append(model(batch))
    lbl_.append(lbl)

    if ((batch_idx + 1) % 2 == 0):
        loss = custom_loss(torch.stack(out_), torch.stack(lbl_))
        loss.backward()

yet there’s still this error: “RuntimeError: CUDA out of memory.” ¯(°_o)/¯

I’d be so thankful if you may help answering these questions:

Is there a separate graph created (requiring more memory) each time a model is fed with a training batch?
How may I get multiple outputs before calculating and backpropagating the loss?

tom · April 23, 2022, 6:33pm

Basically, if things require gradients, the autograd graph is created and only released once you call .backward/torch.autograd.grad on the output (unless you use retain_graph=True in which case it is not released) or the references go out of scope.

This means that splitting the 128 into two batches won’t help you if you want to compute backwards on all of it.

Now, what you could do is compute 64 outputs at a time, save a detached version of them for the next batch, and then combine the undetached outputs with the saved, detached 64 outputs of the last batch for your PCA.

This way, you do your PCA with 128 but only need the autograd graph for 64 samples.

Best regards

Thomas

P.S.: There are mathematical intricacies to this, but you could try and it might work. Another option could be to compute 64 samples with no_grad and 64 in the normal way to preserve the independence between batches better.