Will PyTorch do kernel fusion automatically?

Consider the following code:

import torch

# Model definition
linear1 = torch.nn.Linear(1024,1024, bias=False).cuda()
print(torch.cuda.memory_allocated())  # memory = 4194304 (linear1 parameters)
linear2 = torch.nn.Linear(1024, 1, bias=False).cuda()
print(torch.cuda.memory_allocated()) # memory + 4096 (linear2 parameters)

# Input tensor
inputs = torch.tensor([[1.0]*1024]*1024).cuda() # shape = (1024,1024)
print(torch.cuda.memory_allocated()) # momery + 4194304 (input tensor)

# Forward calculation
out = sum(linear2(linear1(inputs))) # shape = (1)
print(torch.cuda.memory_allocated()) # memory + 4194304 (linear1 output) + 512 (out, PyTorch allocates 512B memory even to a 4-byte tensor)

My question is, where is the GPU memory allocated for linear2 output, which should be 4096 bytes?

Theoretically, this can be fused with the subsequent sum operation (because dloss/dout = dloss/dlinear2_out, which means linear2_out is not needed for backward calculation), but is this what actually happened? If so, is there any docs about this? I can only find kernel fusion in JIT docs, but I’m not using JIT here.

I don’t think this operation is fused, as you’ve already mentioned that you are not scripting the model.
torch.cuda.memory_allocated() will return the currently allocated memory, not its peak (use torch.cuda.max_memory_allocated() for is or torch.cuda.memory_summary()).
Based on your description, I would guess that linear2_out is already freed as it’s not needed for the backward calculation and will thus not show up in the next memory_allocated() print statement.

1 Like

Thanks for your reply! Yes, this is more reasonable. PyTorch is a dynamic computation graph framework, so it’s possible to free memory when it is not needed.