Requires_grad_() on sample?

I suppose this is rather a machine learning question, but it might also be due to my lack of understanding of autograd’s function, so I’ll ask here as well.

I am trying to reverse engineer code for which the previous maintainer is unreachable. On the batches returned from the DataLoader, requires_grad_() is called, which should mean when backpropagating, the samples will accumulate gradients?

From what I know, I don’t really know any training pipeline that wants to optimize the dataset samples. Could it be to aid in computing some sort of measurement on the dataset based on the gradients?

Also, are samples returned from the dataset and then aggregated by the DataLoader always deep copied? That is, when multiple epochs are run on the same dataset, will these gradients continue to accrue on the same sample? Or will each time a batch is returned and requires_grad_() set these samples start off with no gradients again since they are deep copies?

Thanks for your answer.


  1. There are some cases in which differentiating the loss with respect to the input makes sense. One of them is adversarial training where we slightly try to modify the input in order to fool a classifier. This is rare and I assume you would have known if that was the case in your repo. Another usecase is interpretability - asking the question “which pixel, if it was changed, would influence the target prediction”? which sounds more common but is also not the standard. Finally, there might be technical reasons to need the input to have a gradient, e.g. activation checkpointing which requires at least one of the inputs to have a gradient. It will be easiest if you can share a code example or link to the codebase to help figure out if it’s one of these. It should be also noted that if it can be avoided then it’s best to avoid it - as calculating and storing the extra gradients has overhead.
  2. There are multiple factors in play - whether the dataset loads items from disk, which is typically the case (in which case, for every new batch, a new tensor is allocated) and whether the DataLoader aggregates (concatenates) multiple such items to a batch creating a new tensor. In the typical usecase, the tensor you see in the batch is later discarded, so gradients saved on an item are not accumulated over epochs. It is technically possible, i.e. one could deliberately make choices that make that happen, but not common - again, if you share specific code we can take a better look.

Thanks for your insightful answer.

I am looking into beta-tcvae using the dsprites dataset.

The relevant line is in at the root in ./ and the relevant line still uses the archaic Variable instead of requires_grad

The dataset loads once from disk into memory, then returns the tensors from memory which is why the question arose.

Could you also elaborate shortly on why activation checkpointing requires at least one of the inputs to have a gradient? I primarily read

Activation checkpointing is a technique that trades compute for memory. Instead of keeping tensors needed for backward alive until they are used in gradient computation during backward, forward computation in checkpointed regions omits saving tensors for backward and recomputes them during the backward pass. Activation checkpointing can be applied to any part of a model.

and ctrl+f the site for “gradient” but I couldn’t figure out the reason.

Thanks for your answer

I’m not familiar with TCVAE and don’t remember beta-VAE well enough to be sure there’s no sorcery in there, so take this with a grain of salt. If I’m correct that the VAE’s parameters do not include the data itself, then since the optimizer only gets vae.parameters(), then even if a gradient on x is calculated, x is not modified. I’ve also not seen a usage of .grad anywhere there, but it could be that some of the imported code uses that and I missed it.

For your question - it appear activation checkpointing is not in use here (which might be confusing, because the word “checkpoint” appears, but it’s in a “saved model” sense. The naming clash is unfortunate), so it is only worth discussing if you are curious. Activation checkpointing (at least the current implementation, that is exactly being phased out nowadays) needs a tensor with gradients because it is implemented as a custom autograd function, which has the role of a node in the computational graph, and if none of its input requires gradients, the autograd engine thinks it has no reason to start building a computational graph (which we do want, so we can backpropagate through it).

Will be happy to elaborate, but for your actual problem - what I would do is run the code in debug mode and see if at any point x.grad is populated. If so, then I would search where it is used. If it is either never populated or never accessed, the Variable wrapper can be removed.

1 Like

Hm, if you indicate that it is technology that is likely being phased out right now, I’ll hold off on dealing with it for now.

By debug mode, do you mean there are callbacks that you can attach to the population of torch.Tensor.grad or do you mean just seeing if x.grad has any value each iteration before it is reset?

The second, simpler option. I’d step through the lines of the iteration code (one time is enough) and monitor x.grad.