I assume you want to store all computation graphs to store the “gradient information”?
If so, then the large increase in memory would be expected. Detaching the tensor would reduce the memory usage, but won’t allow you to compute the gradients w.r.t. the previously used parameters anymore.
You could reduce the batch size and compute the gradients using your custom loss for a smaller number of samples while accumulating the gradients (if this is possible using your loss).
1 Like