I trained a model with multiple GPUs using model parallelism. When I tested that model with a single GPU, I got the following error:
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:81
Does it mean that if I train a model with multiple GPUs, I have to run that model with the same number of GPUs? Is there any way to avoid that? I am using the recommended approach to save my PyTorch model. http://pytorch.org/docs/notes/serialization.html#recommend-saving-models
No you don’t have to, but when you re-instantiate the model you need to specify correct GPU indices in the DataParallel module. If you serialized the whole DataParallel part, you need to take the
.module attribute out, and re-wrap it in
Thanks for you reply.
However, I was not using DataParallel module as I was doing model parallelism instead. I distributed different parts of my model into different GPUs using
torch.nn.Module.cuda and copied tensors between GPUs using
torch.Tensor.cuda. I got that runtime error when I loaded my model on a single GPU.
@heilaw can you give a small script to reproduce this?
Also, as we mention in the link you gave, we recommend using
load_state_dict to keep things simple.
Am I understanding you right: I can inference a model with single gpu that trained by multi gpu, and vise-versa?
Yes, that’s possible.
If you are training a multi-GPU model, you should store the
model.module.state_dict() as explained here to remove the
.module attributes, which might create errors when trying to load it back on a standard model.