I’ve trained WaveGlow model from here with multiple GPU, but when I try to load the checkpoint to do inference (through inference.py), some checkpoints are loaded without any problem, but most of them raise the error below:
Traceback (most recent call last): File "inference.py", line 105, in <module> args.sampling_rate, args.is_fp16, args.denoiser_strength) File "inference.py", line 46, in main model_state_dict = torch.load(waveglow_path, map_location="cuda:0")['model'].state_dict() File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 387, in load return _load(f, map_location, pickle_module, **pickle_load_args) File "/home/anaconda3/envs/dl/lib/python3.6/site-packages/torch/serialization.py", line 581, in _load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: storage has wrong size: expected 3901634075968565895 got 512
I changed the map_location to “cpu” and “cuda” and also tried to load the checkpoint with the same number of GPUs used during training, but still get the same error.
When I train the model with a single GPU, all checkpoints are loaded without any issue. This happens only after I run distributed training.