How could I train on multi-gpu and infer with single gpu

I train a model on two gpus, and then save the model like this:

    net = nn.DataParallel(net)
    ....., save_path)

Then I load the model and run inference with single gpu:


the is like this:

    im = cv2.imread('./cropped.jpg')
    im = cv2.resize(im, (224, 224)).transpose(2, 0, 1)
    im = torch.tensor([im, im, im], dtype = torch.float32)
    model_path = './res/model.pytorch'
    model = torch.load(model_path)

    out = model(im).detach().cpu().numpy()

I met the error message like this:

  File "", line 25, in test
    out = model(im).detach().cpu().numpy()
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/modules/", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 110, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 121, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 36, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 29, in scatter
    return scatter_map(inputs)
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 16, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 14, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 73, in forward
    streams = [_get_stream(device) for device in ctx.target_gpus]
  File "/home/zhangzy/.local/lib/python2.7/site-packages/torch/nn/parallel/", line 100, in _get_stream
    if _streams[device] is None:
IndexError: list index out of range

What is wrong with this code then?

1 Like

I would say that u are calling out.cpu() but you never used gpu()

for doing so, you have do use
model = torch.load(model_path).cuda()

I dont think it is the root cause, because I add the cuda() and the error still exists.

That error sounds like there is no gpus available.

Have you noticed you wrongly wrote CUDA_VISIBLE_DEIVCES instead of CUDA_VISIBLE_DEVICES ?

If it’s a transcription error can u try torch.cuda.is_available() and torch.cuda.device_count() to check if you have available gpus/cuda when you are running that?

You should also to consider you have to allocate (as i said) both, model and input, in gpu by calling .cuda() method.

I have access to my gpus, the program works when I run python, but it will not work if I run CUDA_VISIBLE_DEVICES python
The root of this problem seems to be that I train my model with two gpus (nn.DataParallel), but I run test on a single gpu.

hmmm It could be, when you use data parallel state dict create a submodule called module. Try to save the state_dict instead of the model and before saving it allocate model in cpu and load state dict instead of that

Your Problem lies within your saving/loading code:

You save your model via, save_path) where net is an instance of nn.DataParallel

I would recommend to change a few things:

  1. You should only save the state dict instead of your model. See this post for further details

  2. You should not save the stat dict of your DataParallel instance but your model’s state dict since the data parallel is simply a wrapper and also contains things like the used GPUs and copies of your model. You can access the model by calling net.module just as @JuanFMontesinos mentioned.

Taking these steps into account you would save your model with code like this:

net = nn.DataParallel(net)
    ....., save_path)

and load your model like this:

model = YourNetworkClass() # create an instance of your network

If you want to infer on multiple GPUs or continue training on multiple GPUs you would have to wrap your model again with nn.DataParallel.

Also a good practice would be to move the model to cpu before saving it’s state_dict and move it back to GPU afterwards. This way the state dict will also be loaded to CPU instead of being loaded to GPU directly (which happens if you save the state_dict of a GPU-Model since you would save CUDA-Tensors). Loading a model on CPU is better practice since you could also deploy it to machines wich aren’t CUDA-capable.