Training gets slower when resuming from a checkpoint

I am training a UNet 3d segmentation model using DDP on a multi-node GPU cluster (8 single-gpu nodes). The network has approximately ~60M parameters, and I am training with FP16.

I am seeing that the training performance significantly slows down (by a factor of 2-3) when resuming from a checkpoint. When I start a new training it takes an average of 35 minutes per epoch. But, when I restart the training from a previous checkpoint it takes over 90 minutes per epoch. There are no other changes I make to the training configuration, except that I load from a prior checkpoint.
I have tested this several times and the behavior is consistent.
Would really appreciate any help with this.

That’s a strange issue. Could you profile your workload (a few training iterations, e.g. 10) using Nsight Systems for both use cases and check where the bottleneck in your code is (or which operations are slowing down)?