Hi, I’m using a 4 GPUs machine with torch.distributed for training, and I want to do the inference with the trained model on another mahcine with only one GPU. But when I run the code like this:
python -m torch.distributed.launch --nproc_per_node=1 visualizer_distributed.py.py
I got an error
Traceback (most recent call last):
File "visualizer_distributed.py", line 21, in <module>
model = torch.nn.parallel.DistributedDataParallel(model)
File "H:\anaconda3\lib\site-packages\torch\nn\parallel\distributed.py", line 259, in __init__
self.process_group = _get_default_group()
NameError: name '_get_default_group' is not defined
Traceback (most recent call last):
File "H:\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "H:\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 235, in <module>
main()
File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['H:\\anaconda3\\python.exe', '-u', 'visualizer_distributed.py', '--local_rank=0']' returned non-zero exit status 1.
Here is the snippet of code
model = ...
model = torch.nn.parallel.DistributedDataParallel(model)
model.load_state_dict(torch.load(model_params))
model.cuda()
When I set the device_ids like
model = ...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=0)
torch.cuda.set_device(0)
model.load_state_dict(torch.load(model_params))
model.cuda()
I got:
Traceback (most recent call last):
File "visualizer_distributed.py", line 21, in <module>
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=0)
File "H:\anaconda3\lib\site-packages\torch\nn\parallel\distributed.py", line 259, in __init__
self.process_group = _get_default_group()
NameError: name '_get_default_group' is not defined
Traceback (most recent call last):
File "H:\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "H:\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 235, in <module>
main()
File "H:\anaconda3\lib\site-packages\torch\distributed\launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['H:\\anaconda3\\python.exe', '-u', 'visualizer_distributed.py', '--local_rank=0']' returned non-zero exit status 1.
Does anyone know why the problem occurs and how to use DistributedDataParallel for inference on a single-GPU machine?
Thanks in advance!