Torch.save behavior

Hey all,

I’m running a couple of models on a multi-GPU system. When I attempt to use torch.save() from a GPU other than device 0 while running another model on device 0, I get the following error; however the saving functionality works perfectly for all models running on GPU 0. I’ve looked into the documentation on serialization semantics and I seem to be following the recommended practices, and the default pickle settings also seem to be okay for this use case as well. Does anyone have any insight into this problem?

Link to source for torch.save(): http://pytorch.org/docs/_modules/torch/serialization.html#save

  File "/home/adamvest/models.py", line 156, in save_model
    torch.save(self.model.state_dict(), "%s/weights.pth" % self.args.out_folder)
  File "/home/adamvest/lib/python/torch/serialization.py", line 120, in save
    return _save(obj, f, pickle_module, pickle_protocol)
  File "/home/adamvest/lib/python/torch/serialization.py", line 192, in _save
    serialized_storages[key]._write_file(f)
RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /b/wheel/pytorch-src/torch/csrc/generic/serialization.cpp:38

have you tried to switch the current device with:

with torch.cuda.device(1):
    torch.save(...)
1 Like

Yes, I was able to work around my issues using this or moving the model to cpu before saving. Still not sure of the root cause though.