How to use nn.parallel.DistributedDataParallel

I want to use nn.parallel.DistributedDataParallel to train my model on single machine with 8 2080TI GPUs. I set distributed config as torch.distributed.init_process_group(backend=‘nccl’, init_method=‘tcp://localhost:1088’, rank=0, world_size=1)
However, no GPU works in train. And if I use nn.DataParallel, this problem is not existence.

DistributedDataParallel (DDP) is multi process training. For you case, you would get best performance with 8 DDP processes, where the i-th process calls:

torch.distributed.init_process_group(
    backend=‘nccl’, 
    init_method=‘tcp://localhost:1088’, 
    rank=i, 
    world_size=8
)

Or, you could set env vars and use https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py

Thanks for your reply. Now I can run my training program and GPU works, but nothing print in terminal. Why?

Can you share a code snippet? What do you expect to see in terminal?

Hello,

I am in a very similar situation where I have a single node and 8 GPUs. I used the following resource as a guideline for distributed parallel training: https://github.com/dnddnjs/pytorch-multigpu

I was able to run this example fine, but when I try to load the model, I get the following error:
Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 4 does not equal 0 (while checking arguments for cudnn_convolution)

Could this be a problem in how I am loading the training data?

Hi, it means that your input and model weight are not on the same device, just like your input on GPU-0 while your model weight on GPU-1. Note that both input and weight must be obtained by same device. I think it may be caused by the wrong load way. Can you show your loading code or give an example?

Thanks for the reply. I believe I found the error. The original code was written for a single GPU and unfortunately there were multiple places where cuda:0 was hardcoded.

However, I do have a followup question. From all the examples I have seen so far, the dataloader is associated with DistributedSampler: https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html

In many cases the dataloader loads files in a directory. In the case of using DDP, the DistributedSampler would be used to allocate files for each GPU such that each GPU would get a unique distribution of samples from the total dataset (I am assuming that the total data items is integer divisible by the number of GPUs).

In my case I am loading 3D data and then take patches of this data, which serves as the training input of the network. So one data file corresponds to more than one training input.

What I do currently is that I load the data by loading the files in a folder. Then I use a custom sampler that loads the patch indices and has an iterator that passes patches when called. I feed this sampler to a Dataloader, where I pass in the whole data, the sampler, and batch size. This works fine for one GPU.

I am now moving on to converting my code to DDP. I could put the DistributedSampler after my custom sampler, but I worry about the idea of multiple GPUs accessing the same file (again the input is a patch and different patches can come from the same file). Am I correct to say that this would be a problem?

Another approach could be to put the DistributedSampler before my current sampler. But I am a bit unsure how to hook up this DistributedSampler to my existing code.

I suppose yet another method could be to bypass using torch.utils.data.distributed.DistributedSampler and perhaps instead have my initial Dataset have a getitem that distributes the files among the GPUs in a manner similar to the DistributedSampler, and then keep the rest of my hooks the same. Or alternatively, in the main loop I could have the logic for handling the distribution of files and pass this into the spawned process.

Would one approach be better than others? Or should I be using another approach altogether? Does the DDP work properly if the code does not use the DistributedSampler?

@mrshenli

  1. Can you please define the terms: 1) node, 2) process in the context you are using them?
  2. If I want to train my model on 4 GPUs, do you call it 4 processes? or 1 process?
  3. Does init_method correspond to the address of my PC or to the GPU I’m accessing on a cluster?
  4. In this tutorial, what were you referring to as machine?

@mingyang94

  1. Can you please explain how you arrived at:

Hey @spabho

  1. Can you please define the terms: 1) node, 2) process in the context you are using them?

We usually use one node/machine/server to represent one physical computer which can be equipped with multiple GPUs.

One process is in the context of process/thread.

If I want to train my model on 4 GPUs, do you call it 4 processes? or 1 process?

In this case, using 4 processes with DDP should give you the best performance.

Does init_method correspond to the address of my PC or to the GPU I’m accessing on a cluster?

It corresponds to the address of your PC. It is giving some information for the 4 DDP processes to perform rendezvous.

In this tutorial, what were you referring to as machine?

Machines should always refer to node/server/machine/computer.

To be clear, there are three concepts involved in DDP training:

  • Node/Machine/Server: a physical computer that can contain multiple GPUs. It can also talk to other node through network.
  • GPU: just one GPU
  • Process: Each process should run its own DDP instance. Usually Each DDP instance should exclusively operate on one GPU (if your model can fit in one GPU), and DDP instances will talk to each other through network.

This example might serve better as a starting point for DDP.