DDP with gradient checkpointinting: Confusing Documentation

eliphatfs · April 27, 2024, 6:07pm

This is the documentation on DDP w/ gradient checkpointing:

DistributedDataParallel currently offers limited support for gradient checkpointing with torch.utils.checkpoint(). If the checkpoint is done with use_reentrant=False (recommended), DDP will work as expected without any limitations. If, however, the checkpoint is done with use_reentrant=True (the default), DDP will work as expected when there are no unused parameters in the model and each layer is checkpointed at most once (make sure you are not passing find_unused_parameters=True to DDP). We currently do not support the case where a layer is checkpointed multiple times, or when there unused parameters in the checkpointed model.

“We currently do not support the case where a layer is checkpointed multiple times, or when there unused parameters in the checkpointed model.”
“DDP will work as expected without any limitations.”

Does the first sentence scope against ‘use_reentrant=True’ or does it mean if I have to checkpoint the same layer dynamic number of times, use_reentrant=False will work?
And what versions of PyTorch is this supported? Do I need nightly 2.4 version?