Error when using DistributedDataParallel on single-GPU machine

Hi, I’m using a 4 GPUs machine with torch.distributed for training, and I want to do the inference with the trained model on another mahcine with only one GPU. But when I run the code like this:

python -m torch.distributed.launch --nproc_per_node=1 visualizer_distributed.py.py

I got an error

Traceback (most recent call last):
  File "visualizer_distributed.py", line 21, in <module>
    model = torch.nn.parallel.DistributedDataParallel(model)
  File "H:\anaconda3\lib\site-packages\torch\nn\parallel\distributed.py", line 259, in __init__
    self.process_group = _get_default_group()
NameError: name '_get_default_group' is not defined
Traceback (most recent call last):
  File "H:\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "H:\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 235, in <module>
    main()
  File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['H:\\anaconda3\\python.exe', '-u', 'visualizer_distributed.py', '--local_rank=0']' returned non-zero exit status 1.

Here is the snippet of code

model = ...
model = torch.nn.parallel.DistributedDataParallel(model)
model.load_state_dict(torch.load(model_params))
model.cuda()

When I set the device_ids like

model = ...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=0)
torch.cuda.set_device(0)
model.load_state_dict(torch.load(model_params))
model.cuda()

I got:

Traceback (most recent call last):
  File "visualizer_distributed.py", line 21, in <module>
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=0)
  File "H:\anaconda3\lib\site-packages\torch\nn\parallel\distributed.py", line 259, in __init__
    self.process_group = _get_default_group()
NameError: name '_get_default_group' is not defined
Traceback (most recent call last):
  File "H:\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "H:\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 235, in <module>
    main()
  File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['H:\\anaconda3\\python.exe', '-u', 'visualizer_distributed.py', '--local_rank=0']' returned non-zero exit status 1.

Does anyone know why the problem occurs and how to use DistributedDataParallel for inference on a single-GPU machine?

Thanks in advance!

Based on the error message it looks like you are using a Windows machine.
I’m not familiar with Windows, but I thought it doesn’t support distributed applications.
Were you also using Windows on the first machine?

Yes that’s right, I’m using Windows for inference and Ubuntu for training.

In that case I think you cannot use a distributed setup. :confused:
However, since you have a single GPU on your Windows system, you won’t get any benefits anyway. :wink:

So what should I do if I want to use the distributed model for inference in the single-GPU windows?

I was using nn.DataParallel in these two machines, I must call nn.DataParallel(model) before loading the model. I’m now just trying to do the same thing with DistributedDataParallel but got the problem. :sweat_smile:

I think the easiest way would be to store the state_dict without the nn.DataParallel .module attribute (I assume you are stuck there) as described here.

I see…

I will try to fix it, thank you for the help!