How do I checkpoint the last layer using `torch.utils.checkpoint`?

raining_day513 · July 1, 2023, 3:02am

I will use a very simple example to explain my problem. Say I have a 6-layer feedforward model represented as an array [1, 2, 3, 4, 5, 6], and I want to divide it into 3 segments, i.e. [1,2], [3,4], [5,6], for gradient-checkpointing. Given the comment at line516 of torch.utils.checkpoint_sequential as below, PyTorch will only call the torch.utils.checkpoint for the first two segments [1,2] [3,4] and the remaining two layers 5, 6 normally with their output data being saved for backward. With this design, I don’t know how to save the output data of the last layer 6 for backward phase.

Say I call torch.utils.checkpoint for the segment [5, 6] too, is it possible to write some extra code to make sure that the result of layer 6 will be saved for backward? Or maybe I don’t need to worry about this, because PyTorch will always save the output data of the last layer to compute the first gradient (the partial derivative of the last layer output w.r.t. the output of the last layer)?

github.com

pytorch/pytorch/blob/b3e60ee05250a0b3289f4611d916e8ac4b0a0a78/torch/utils/checkpoint.py#L515_L525


      
              def forward(input):
                  for j in range(start, end + 1):
                      input = functions[j](input)
                  return input
          
          
    return forward
          
          
if isinstance(functions, torch.nn.Sequential):
              functions = list(functions.children())
          
          
segment_size = len(functions) // segments
          # the last chunk has to be non-volatile
          end = -1
          for start in range(0, segment_size * (segments - 1), segment_size):
              end = start + segment_size - 1
              input = checkpoint(
                  run_function(start, end, functions),
                  input,
                  use_reentrant=use_reentrant,
                  preserve_rng_state=preserve,
              )