How to get a list of every tensor in the autograd tree?

Hello,
I am trying to train a rather large pytorch model. Training runs just fine if I use a batch_size of 1, but when using larger batches, the back propagation step fails (see error message below). Unfortunately, I have found it difficult to track down which tensor is causing the problem. My attempts to track down this tensor have been unsuccessful, so far. I have checked every tensor I can find in the code base and I have yet to even find a tensor that has 16 elements.

Is there a more systematic approach that I can take here? Surely, the autograd system must store some sort of reference to the relevant tensors? If I could view a record of all the tensors in the autograd tree, then perhaps I could narrow down my search based on the values in found in the relevant tensor?

Or perhaps there is some other approach that I should take?

Any help would be strongly appreciated.

 -------ERROR--------  
  one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.DoubleTensor [16]] is at version 11; expected version 10 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!  
  --------------------------- 

Have you use the torch.autograd.set_detect_anomaly function? Automatic differentiation package - torch.autograd — PyTorch 2.3 documentation

Can you share the model?

You could try plotting out your network with torchviz? GitHub - szagoruyko/pytorchviz: A small package to create visualizations of PyTorch execution graphs

1 Like

Problem solved:

I was resetting parameters via the overall model (DDP), instead of the individual modules (model.module).

Consequently, the parameters were being reset by each GPU. This resulted in the version number of the tensors being incremented more than expected.

Not sure how to mark this thread as solved.

Thank you for the advice!