Hi, I’m having CUDA out of memory issue now, and I’m thinking to reduce some useless variables in the computation graph to save memory.
Particularly, I have a tensor sim_score which is calculated in the forward pass by some previous encoders. e.g., sim_score = tensor([[1, 2, 3, 4], [5, 6, 7, 8]]). I also have a mask tensor, e.g., mask = tensor([[0, 1, 0, 0], [1, 1, 1, 0]]). I will do something like output = torch.sum(sim_score * mask, dim=1). As you can see, 1, 3, 4, 8 in sim_score are useless in future computation. However, multiplying the mask tensor seems to keep all values in the computation graph. I wonder is there a way to don’t keep those useless values in the graph to save some gpu memory?
I’ve tried to directly assign the values in sim_score to 0 like sim_score[~mask] = 0, but I got the error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation.
Two comments here:
The autograd graph if per-tensor not per-element, so you cannot mask within a tensor to free the computational graph.
For much the same reason assigning values will not cut the graph but instead insert a new node CopySlice:
a = torch.randn(5, requires_grad=True)
b = torch.randn(5, requires_grad=True)
c = torch.zeros(4, 7, requires_grad=True)
d = c * 2
d[1, 1:-1] = a
d[2, 1:-1] = b
will give an autograd graph that looks like this:
If you were to draw intermediate graphs, we would see that while the tensor d stays the same, the autograd node directly attached to it moves from MulBackward0 to the CopySlices with the assignments.
@tom Hi Thomas, thanks for your reply! I have a follow up question, what if I have:
a = torch.randn((4, 7), requires_grad=True)
c = torch.zeros(4, 7, requires_grad=True)
c[1, 1:3] = a[1, 1:3]
c[2, 3:4] = a[2, 3:4]
# instead of operating on a, we use c for future computation
And imagine a is calculated by some previous layers. I guess some values in a won’t receive gradient in backward pass. But can this save some gpu memory for me or not at all (compared to using a * mask for future computation)?