I’m playing with torch.autograd.graph.saved_tensors_hooks to compress the tensors. I found during the process, the following two codes look to have different memory usage (assuming x.size()>> b.size()).
case a)
x = f(a)
y = matmul(x.T, b)
z = matmul(x, c)
case b)
x = f(a)
y = matmul(b.T, x)
z = matmul(x, c)
It looks the case b) keeps using the same x in forward of y,z (verified with id()), while the case a) creates a new instance to save x.T (verified with id()). In that sense, can I say b) is more memory efficient?
Also as a related question, it looks pytorch doesn’t duplicate the same tensor for the forward/backward by looking at the python id of tensors. Is this the right way to check if two tensors point to the same memory allocation?
thanks @ptrblck , but does it hold true for the backward case? Doesn’t autogrid make independent copies of b and b.T? Please correct me if I’m wrong. But if I’m right, then
case a) stores: x, x.T, b, c
case b) stores: x, b.T, b, c
Since x.size()>> b.size(), case a) needs more memory for backward?
No, this should also not be the case and you could profile the memory usage via torch.cuda.memory_summary() to compare both approaches.
print(torch.cuda.memory_summary())
x = torch.randn(1024, 1024).cuda().requires_grad_(True)
b = torch.randn(1024, 1).cuda().requires_grad_(True)
c = torch.randn(1024, 512).cuda().requires_grad_(True)
# y = torch.matmul(x.T, b)
# print(torch.cuda.memory_summary())
# y.mean().backward()
# print(torch.cuda.memory_summary())
# z = torch.matmul(x, c)
# z.mean().backward()
# print(torch.cuda.memory_summary())
y = torch.matmul(b.T, x)
print(torch.cuda.memory_summary())
y.mean().backward()
print(torch.cuda.memory_summary())
z = torch.matmul(x, c)
z.mean().backward()
print(torch.cuda.memory_summary())