Hi,
I am running experiments on other’s people code. They used 16 gpus and the library torch.distributed. I just want to run the code with one gpu, I know it will be slow. Is there a simple way to adapt the code to one GPU without having to learn to use the library pytorch.distributed? at this moment my priority is to see if their code help us, if selected then I’ll focus on that library.
- Will the library pytorch.distributed automatically detect that I only have one GPU and work on it? No, because it is sending me errors.
Use GPU: 0 for training
Traceback (most recent call last):
File “train.py”, line 97, in
main()
File “train.py”, line 29, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, idx_server, opt))
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn
while not spawn_context.join():
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join
raise Exception(msg)
Exception:– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/home/ericd/tests/CC-FPSE/train.py”, line 37, in main_worker
dist.init_process_group(backend=‘nccl’, init_method=opt.dist_url, world_size=world_size, rank=rank)
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 397, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 120, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
TypeError: init(): incompatible constructor arguments. The following argument types are supported:
1. torch.distributed.TCPStore(arg0: str, arg1: int, arg2: int, arg3: bool)
line 37 is
torch.distributed.init_process_group(backend='nccl', init_method=opt.dist_url, world_size=world_size, rank=rank)
-
How can I adapt the code to my situation? Ideally there is some parameter that I can use and make things compatible.
-
Do I need to find certain lines and modify them? I am afraid that may be the case but I don’t know where or which.
I am working with GitHub - xh-liu/CC-FPSE: Code for NeurIPS 2019 paper "Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis" but I am not sure if this helps.