Training gets slower when resuming from a checkpoint

Subhash · July 19, 2022, 1:38pm

I am training a UNet 3d segmentation model using DDP on a multi-node GPU cluster (8 single-gpu nodes). The network has approximately ~60M parameters, and I am training with FP16.

I am seeing that the training performance significantly slows down (by a factor of 2-3) when resuming from a checkpoint. When I start a new training it takes an average of 35 minutes per epoch. But, when I restart the training from a previous checkpoint it takes over 90 minutes per epoch. There are no other changes I make to the training configuration, except that I load from a prior checkpoint.
I have tested this several times and the behavior is consistent.
Would really appreciate any help with this.

ptrblck · July 19, 2022, 10:52pm

That’s a strange issue. Could you profile your workload (a few training iterations, e.g. 10) using Nsight Systems for both use cases and check where the bottleneck in your code is (or which operations are slowing down)?

Pranil_Patil · September 24, 2024, 6:17pm

It’s the same issue for me with Huggingface’s SFT Trainer. I am using 2 A100 GPUs. When I started from checkpoint, time per iteration increased almost five times.