Avoiding blowup of Gradient Tape Memory for Accumulated Tensors

sad_robot · July 14, 2024, 11:38pm

Hi, I’ve got to re-run a variational inference deep learning model (using reparameterization trick) many times before each backwards step to get accumulated prediction metrics…

But for simplicity I’ll ask about this toy problem: Is it possible change the code below such that mean_output retains valid gradients but also the gradient memory footprint doesn’t grow with additional samples?

batch = next(iter(train_loader))
batch_inputs = batch['x'].cuda()
mean_output = 0
n_samples = 10000 # I want gradient tape memory to remain constant with more samples
for i in range(n_samples):
    mean_output += stochastic_model(batch_inputs)/n_samples

P.S. A key point is that the accumulation of samples doesn’t store intermediate tensors.