DDP model hangs on torch.save()

When I save my DDP model with torch.save it hangs indefinitely. After playing around with it some, I noticed that accessing any value in model.state_dict() will hang.

For example, state_dict.keys() works fine but state_dict.values() will hang. Upon further inspection I can assign one of the values e.g. val = state_dict[key1] and I can print properties like val.device. However, printing val will hang indefinitely.

Essentially accessing this memory will hang. This feels like very bizarre behavior and I’m not sure how to proceed. Would love ideas!

Hi asivap,

Did you ever figure this out? I’m having the same issue.