Unable to load WaveGlow checkpoint after training with multiple GPUs

Neuperc · June 14, 2019, 2:38pm

I’ve trained WaveGlow model from here with multiple GPU, but when I try to load the checkpoint to do inference (through inference.py), some checkpoints are loaded without any problem, but most of them raise the error below:

Traceback (most recent call last):
  File "inference.py", line 105, in <module>
    args.sampling_rate, args.is_fp16, args.denoiser_strength)
  File "inference.py", line 46, in main
    model_state_dict = torch.load(waveglow_path, map_location="cuda:0")['model'].state_dict()
  File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 581, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 3901634075968565895 got 512

I changed the map_location to “cpu” and “cuda” and also tried to load the checkpoint with the same number of GPUs used during training, but still get the same error.

When I train the model with a single GPU, all checkpoints are loaded without any issue. This happens only after I run distributed training.

ptrblck · June 14, 2019, 3:07pm

This usually happens when multiple processes try to write to a single file.
However, this should be prevented with the if condition if rank == 0:.
Did you remove it or changed the save logic somehow?

Neuperc · June 14, 2019, 3:58pm

Yes, exactly! it was a simple mistake by me. I commented the original “save_checkpoint” section and only added “save_checkpoint” after the epoch loop without checking if rank==0. Now it works without any errors.
Thanks a lot for your help!

justinliu · February 7, 2020, 7:09pm

I was wondering in such a case is the checkpoints still salvageable or are they simply damaged?

ptrblck · February 7, 2020, 11:57pm

If multiple process have written to the same file, it’s most likely damaged and you won’t be able to restore it.