I’m trying to apply gradient checkpoint to some sub-layer of
Transformer, so that I can enable larger batch size and higher throughput for training.
But, without the knowledge of the activation size of each
PyTorch, I cannot decide to apply gradient checkpoint on which sub-layer.
Even though, I can insert
torch.cuda.memory_summary and timing at each line of forward path, that would be a boilerplate.
So, is there any cheatsheet for the activation size of each
PyTorch? Or, a guild line for applying gradient checkpoint?