I’m trying to apply gradient checkpoint to some sub-layer of Transformer
, so that I can enable larger batch size and higher throughput for training.
But, without the knowledge of the activation size of each OP
in PyTorch
, I cannot decide to apply gradient checkpoint on which sub-layer.
Even though, I can insert torch.cuda.memory_summary
and timing at each line of forward path, that would be a boilerplate.
So, is there any cheatsheet for the activation size of each OP
in PyTorch
? Or, a guild line for applying gradient checkpoint?