I am training a UNet 3d segmentation model using DDP on a multi-node GPU cluster (8 single-gpu nodes). The network has approximately ~60M parameters, and I am training with FP16.
I am seeing that the training performance significantly slows down (by a factor of 2-3) when resuming from a checkpoint. When I start a new training it takes an average of 35 minutes per epoch. But, when I restart the training from a previous checkpoint it takes over 90 minutes per epoch. There are no other changes I make to the training configuration, except that I load from a prior checkpoint.
I have tested this several times and the behavior is consistent.
Would really appreciate any help with this.