Unable to load WaveGlow checkpoint after training with multiple GPUs

I’ve trained WaveGlow model from here with multiple GPU, but when I try to load the checkpoint to do inference (through inference.py), some checkpoints are loaded without any problem, but most of them raise the error below:

Traceback (most recent call last):
  File "inference.py", line 105, in <module>
    args.sampling_rate, args.is_fp16, args.denoiser_strength)
  File "inference.py", line 46, in main
    model_state_dict = torch.load(waveglow_path, map_location="cuda:0")['model'].state_dict()
  File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 581, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 3901634075968565895 got 512

I changed the map_location to “cpu” and “cuda” and also tried to load the checkpoint with the same number of GPUs used during training, but still get the same error.

When I train the model with a single GPU, all checkpoints are loaded without any issue. This happens only after I run distributed training.

This usually happens when multiple processes try to write to a single file.
However, this should be prevented with the if condition if rank == 0:.
Did you remove it or changed the save logic somehow?

1 Like

Yes, exactly! it was a simple mistake by me. I commented the original “save_checkpoint” section and only added “save_checkpoint” after the epoch loop without checking if rank==0. Now it works without any errors.
Thanks a lot for your help!

1 Like

I was wondering in such a case is the checkpoints still salvageable or are they simply damaged?

If multiple process have written to the same file, it’s most likely damaged and you won’t be able to restore it.

1 Like