Consider the following code:
import torch
# Model definition
linear1 = torch.nn.Linear(1024,1024, bias=False).cuda()
print(torch.cuda.memory_allocated()) # memory = 4194304 (linear1 parameters)
linear2 = torch.nn.Linear(1024, 1, bias=False).cuda()
print(torch.cuda.memory_allocated()) # memory + 4096 (linear2 parameters)
# Input tensor
inputs = torch.tensor([[1.0]*1024]*1024).cuda() # shape = (1024,1024)
print(torch.cuda.memory_allocated()) # momery + 4194304 (input tensor)
# Forward calculation
out = sum(linear2(linear1(inputs))) # shape = (1)
print(torch.cuda.memory_allocated()) # memory + 4194304 (linear1 output) + 512 (out, PyTorch allocates 512B memory even to a 4-byte tensor)
My question is, where is the GPU memory allocated for linear2 output, which should be 4096 bytes?
Theoretically, this can be fused with the subsequent sum
operation (because dloss
/dout
= dloss
/dlinear2_out
, which means linear2_out
is not needed for backward calculation), but is this what actually happened? If so, is there any docs about this? I can only find kernel fusion in JIT docs, but I’m not using JIT here.