Adapt code with pytorch.distributed to only one gpu

Hi,
I am running experiments on other’s people code. They used 16 gpus and the library torch.distributed. I just want to run the code with one gpu, I know it will be slow. Is there a simple way to adapt the code to one GPU without having to learn to use the library pytorch.distributed? at this moment my priority is to see if their code help us, if selected then I’ll focus on that library.

  • Will the library pytorch.distributed automatically detect that I only have one GPU and work on it? No, because it is sending me errors.

Use GPU: 0 for training
Traceback (most recent call last):
File “train.py”, line 97, in
main()
File “train.py”, line 29, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, idx_server, opt))
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn
while not spawn_context.join():
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join
raise Exception(msg)
Exception:

– Process 0 terminated with the following error:
Traceback (most recent call last):
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/home/ericd/tests/CC-FPSE/train.py”, line 37, in main_worker
dist.init_process_group(backend=‘nccl’, init_method=opt.dist_url, world_size=world_size, rank=rank)
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 397, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/home/ericd/anaconda/envs/myPytorch/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 120, in _tcp_rendezvous_handler
store = TCPStore(result.hostname, result.port, world_size, start_daemon)
TypeError: init(): incompatible constructor arguments. The following argument types are supported:
1. torch.distributed.TCPStore(arg0: str, arg1: int, arg2: int, arg3: bool)

line 37 is

torch.distributed.init_process_group(backend='nccl', init_method=opt.dist_url, world_size=world_size, rank=rank)

  • How can I adapt the code to my situation? Ideally there is some parameter that I can use and make things compatible.

  • Do I need to find certain lines and modify them? I am afraid that may be the case but I don’t know where or which.

I am working with https://github.com/xh-liu/CC-FPSE but I am not sure if this helps.

It depends on how the code was written. If the model forward function has sth like:

def forward(input):
    x1 = self.layer1(input.to("cuda:0"))
    x2 = self.layer2(input.to("cuda:1"))
    x3 = self.layer3(input.to("cuda:2"))
    return x3

Then it would certainly fail as it cannot find the device.

If there is nothing like that, you probably can get around by doing sth like:

torch.distributed.init_process_group(backend='nccl', init_method=opt.dist_url, world_size=1, rank=0)

You might also need to change opt.dist_url into sth like "tcp://localhost:23456"

It worked, thank you!