Hey all,
I’m running a couple of models on a multi-GPU system. When I attempt to use torch.save() from a GPU other than device 0 while running another model on device 0, I get the following error; however the saving functionality works perfectly for all models running on GPU 0. I’ve looked into the documentation on serialization semantics and I seem to be following the recommended practices, and the default pickle settings also seem to be okay for this use case as well. Does anyone have any insight into this problem?
Link to source for torch.save(): http://pytorch.org/docs/_modules/torch/serialization.html#save
File "/home/adamvest/models.py", line 156, in save_model
torch.save(self.model.state_dict(), "%s/weights.pth" % self.args.out_folder)
File "/home/adamvest/lib/python/torch/serialization.py", line 120, in save
return _save(obj, f, pickle_module, pickle_protocol)
File "/home/adamvest/lib/python/torch/serialization.py", line 192, in _save
serialized_storages[key]._write_file(f)
RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /b/wheel/pytorch-src/torch/csrc/generic/serialization.cpp:38