Questions about Gradient checkpointing

Hey I have 2 questions with regards to using gradient checkpointing

1. Do Dropout and Batchnorm layers work now with Checkpointing?

Is this still the case? Some of the pytorch documentation suggests that this may have changed for at least for dropout? Can someone provide a definitive answer?

2. What is the preferred way to use data parallel with Checkpoint_Sequential?
Can someone give me a template for both DataParallel and DistributedDataParallel, that actually works with multiple GPUs?

1 Like