I’ve trained WaveGlow model from here with multiple GPU, but when I try to load the checkpoint to do inference (through inference.py), some checkpoints are loaded without any problem, but most of them raise the error below:
Traceback (most recent call last):
File "inference.py", line 105, in <module>
args.sampling_rate, args.is_fp16, args.denoiser_strength)
File "inference.py", line 46, in main
model_state_dict = torch.load(waveglow_path, map_location="cuda:0")['model'].state_dict()
File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 387, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 581, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 3901634075968565895 got 512
I changed the map_location to “cpu” and “cuda” and also tried to load the checkpoint with the same number of GPUs used during training, but still get the same error.
When I train the model with a single GPU, all checkpoints are loaded without any issue. This happens only after I run distributed training.