Training on Cluster

Hi,

I’m attempting to train my model over multiple nodes of a cluster, on 3GPUs. I am running the training script from Node 1, where GPUs 0, 1 are present while Node 2 has GPU 2. Following the distributed training example of FasterRCNN with this command, CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node=3 --use_env train.py , I get the following error,

HCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=37 error=10 : invalid device ordinal
Traceback (most recent call last):
  File "train.py", line 266, in <module>
    train_fpn()
  File "train.py", line 115, in train_fpn
    init_distributed_mode(args)
  File "/udd/rsundara/Code/head_detection/vision/utils.py", line 321, in init_distributed_mode
    torch.cuda.set_device(args.gpu)
  File "/udd/rsundara/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:37

Can someone please help me figure out how to train models across multiple nodes?

Regards,

Since you are training on multiple nodes of a cluster, what does your output of nvidia-smi look like? I think you shouldn’t use CUDA_VISIBLE_DEVICES=0,1,2 as not all 3 GPUs are bot connected and visible to the same system.

CUDA_VISIBLE_DEVICES is used only when there are multiple GPUs in the system and we want to run on selected GPUs. Hope this solves some part of it.

Can you put the link for the tutorial for more help? Thanks.

I think you shouldn’t use CUDA_VISIBLE_DEVICES=0,1,2 as not all 3 GPUs are bot connected and visible to the same system.

Yes, you’re right.

I’m following the training similar to the one mentioned here : https://github.com/pytorch/vision/tree/v0.4.0/references/detection

Is there a way I can train across multiple nodes in pytorch?

See torch.distributed

Hi @iffiX could you please be more specific ? I am using distributed and please refer to the link in my previous post as to how I utilize it

I see, then you will have to write your own launch script since torch.distributed.launch assumes you have the same configuration per node (i.e. launching the same number of processes per node, having same number of gpus per node).

So basically you will have to set CUDA_VISIBLE_DEVICES to 0, 1 on node 1 and 0 on node two. and then think of some way to pass your commandline options to each of your processes.