Question about activation checkpoint with FSDP

111146 · February 28, 2023, 3:31am

I found that PyTorch’s FSDP has its own wrapping function (apply_activation_checkpointing_wrapper) for the activation checkpoint.

github.com

pytorch/workshops/blob/master/FSDP_Workshop/activation_checkpointing/ac_handler.py

import torch
import os
import torch.distributed as dist
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    checkpoint_wrapper,
    CheckpointImpl,
    apply_activation_checkpointing_wrapper,
)

from transformers.models.t5.modeling_t5 import T5Block

from functools import partial

non_reentrant_wrapper = partial(
    checkpoint_wrapper,
    offload_to_cpu=False,
    checkpoint_impl=CheckpointImpl.NO_REENTRANT,
)

check_fn = lambda submodule: isinstance(submodule, T5Block)

This file has been truncated. show original

I want to know the difference between apply_activation_checkpointing_wrapper and gradient_checkpointing_enable.
When I want to apply activation checkpointing with PyTorch’s FSDP, should I apply the function instead of gradient_checkpointing_enable provided by Huggingface models such as GPT2?

agu · March 7, 2023, 10:48pm

I think that gradient_checkpointing_enable() is HuggingFace’s own built-in method that works because HuggingFace models have manual activation checkpointing calls in the model source code that can be enabled/disabled.

apply_activation_checkpointing_wrapper() can work for general models (not just HuggingFace) since the user must pass the criteria for checkpointing. If you are using a HuggingFace model, you can try using the HuggingFace gradient_checkpointing_enable() since those checkpoints have been hand-picked. Though, I am not familiar with the compatibility with FSDP.