Consider the following code:
import torch # Model definition linear1 = torch.nn.Linear(1024,1024, bias=False).cuda() print(torch.cuda.memory_allocated()) # memory = 4194304 (linear1 parameters) linear2 = torch.nn.Linear(1024, 1, bias=False).cuda() print(torch.cuda.memory_allocated()) # memory + 4096 (linear2 parameters) # Input tensor inputs = torch.tensor([[1.0]*1024]*1024).cuda() # shape = (1024,1024) print(torch.cuda.memory_allocated()) # momery + 4194304 (input tensor) # Forward calculation out = sum(linear2(linear1(inputs))) # shape = (1) print(torch.cuda.memory_allocated()) # memory + 4194304 (linear1 output) + 512 (out, PyTorch allocates 512B memory even to a 4-byte tensor)
My question is, where is the GPU memory allocated for linear2 output, which should be 4096 bytes?
Theoretically, this can be fused with the subsequent
sum operation (because d
out = d
linear2_out, which means
linear2_out is not needed for backward calculation), but is this what actually happened? If so, is there any docs about this? I can only find kernel fusion in JIT docs, but I’m not using JIT here.