Applying gradient checkpoint for Transformer

li-yi-dong · August 2, 2022, 2:38am

I’m trying to apply gradient checkpoint to some sub-layer of Transformer, so that I can enable larger batch size and higher throughput for training.

But, without the knowledge of the activation size of each OP in PyTorch, I cannot decide to apply gradient checkpoint on which sub-layer.

Even though, I can insert torch.cuda.memory_summary and timing at each line of forward path, that would be a boilerplate.

So, is there any cheatsheet for the activation size of each OP in PyTorch? Or, a guild line for applying gradient checkpoint?