Hey I have 2 questions with regards to using gradient checkpointing
1. Do Dropout and Batchnorm layers work now with Checkpointing?
- in this tutorial, it is mentioned that these layers didn’t work with chekpointing:
Is this still the case? Some of the pytorch documentation suggests that this may have changed for at least for dropout? Can someone provide a definitive answer?
2. What is the preferred way to use data parallel with Checkpoint_Sequential?
Can someone give me a template for both DataParallel and DistributedDataParallel, that actually works with multiple GPUs?