DDP model hangs on torch.save()

When I save my DDP model with torch.save it hangs indefinitely. After playing around with it some, I noticed that accessing any value in model.state_dict() will hang.

For example, state_dict.keys() works fine but state_dict.values() will hang. Upon further inspection I can assign one of the values e.g. val = state_dict[key1] and I can print properties like val.device. However, printing val will hang indefinitely.

Essentially accessing this memory will hang. This feels like very bizarre behavior and I’m not sure how to proceed. Would love ideas!

1 Like

Hi asivap,

Did you ever figure this out? I’m having the same issue.

1 Like

Are you loading the model in a DDP setup as well?

I’m having the same issue. The funny thing is, all of the ckpts can be saved successfully except the last epoch’s ckpt.