Hello,
I am trying to train a rather large pytorch model. Training runs just fine if I use a batch_size of 1, but when using larger batches, the back propagation step fails (see error message below). Unfortunately, I have found it difficult to track down which tensor is causing the problem. My attempts to track down this tensor have been unsuccessful, so far. I have checked every tensor I can find in the code base and I have yet to even find a tensor that has 16 elements.
Is there a more systematic approach that I can take here? Surely, the autograd system must store some sort of reference to the relevant tensors? If I could view a record of all the tensors in the autograd tree, then perhaps I could narrow down my search based on the values in found in the relevant tensor?
Or perhaps there is some other approach that I should take?
Any help would be strongly appreciated.
-------ERROR--------
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.DoubleTensor [16]] is at version 11; expected version 10 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
---------------------------