About nn.parallel.DistributedDataParallel

shinra · May 19, 2021, 8:01am

I trained my module with 4 gpus ,like:

mymodule = nn.parallel.DistributedDataParallel(mymodule, device_ids=[local_rank])
But I saved my module by

torch.save(mymodule.state_dict() , '%s/modelG_%d.pth' % (opt.outf, epoch))
When I load it by:
mymodule.load_state_dict(torch.load(f))
I got:
RuntimeError: storage has wrong size: expected -4763383137013773690 got 128

What is wrong? And if there is any way to deal with it without re-training?
Thanks.

liyz15 · May 19, 2021, 9:35am

Have you ensured that only one process is writing to checkpoint? Multiple processes writing to the same checkpoint will corrupt it.

shinra · May 19, 2021, 11:30am

I did not consider this problem .
Looks like that I have to modify my code and re-train my module.
Thanks a lot.