DDP Training Hangs after completing Epoch

Abhinav_Rai · March 17, 2025, 5:23am

I’ve been stuck on this issue for some time and need some guidance.

My network contains a moduledict, where each key refers to a city and the value is a classifier for that city. Depending on the data input to the model, different classifiers are called and trained. This model works fine on a single-gpu setting. However, when I switch to multi GPUs the training hangs at the end of the epoch. My guess is this is due to the fact that based on the data present in the batch different classifiers are being called in different GPUs(this I know is happening). Hence, since the network graph is different across the GPUs the model is not able synchronize properly. Is my assumption here correct ? If so why does it hang only at the end of the epoch and what can I do to fix this issue.

Thanks

yf225 · March 21, 2025, 6:08am

There are allreduce operations at end of DDP, which could be where the hangs show up.

NCCL hangs usually happen because of one of the following reasons:

One of the ranks is taking too long to finish, and other ranks timed out waiting for that rank.
One of the ranks crashed, and other ranks timed out waiting for that rank.
One of the ranks has a classifier or something else that issues an extra collective op, and there is mismatch on the global order of collective ops across all ranks (i.e. deadlock).

ptrblck · March 21, 2025, 6:57pm

Besides the good explanations shared by @yf225 you could also rerun your code with NCCL_DEBUG=INFO to get more additional information about potential issues.