A question about DDP tutorial

I have read the tutorial and I run the demo_basic with 2 visible gpus. In the function, I output the device of labels and outputs, and they are on gpus, while the inputs of torch.randn(20, 10) is on cpu. The tutorial is a little confused that the backend is gloo, but the main function tries to use gpus.
My question is that why don’t we need to transfer the inputs to gpus ?
Also, the tutorial might be more practical if dummy datasets and dataloaders are provided.
def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn(outputs, labels).backward()


My question is that why don’t we need to transfer the inputs to gpus ?

Thanks for pointing this out. Yep, it’s a good practice to also move the inputs to the destination GPU. The reason that the tutorial code didn’t fail is because DDP will recursively detect tensors in the inputs and move them to the target device. See the code below:

Thanks for the quick reply. My further question is that, what is the number of self.device_ids, when I use 2 gpus?

if len(self.device_ids) == 1:
    inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
    output = self.module(*inputs[0], **kwargs[0])
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
    outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs)
    output = self.gather(outputs, self.output_device)

@Frazer You do not need to change the device_ids argument passed to DDP in the distributed tutorial even if you use 2 GPUs. The demo_basic function is run on each spawned process, so each process ends up with a replica of ToyModel wrapped by DDP.

@osalpekar Yeah, I know and I don’t want to change device_ids. I just want to know the flow of the program. I guess you mean len(self.device_ids) == 1 with 2 gpus. So when will the else condition be triggered ?

Ah I see, thanks for the clarification! The else condition will be triggered when you pass a list of multiple ranks as the device_ids arg to the DDP constructor. This basically allows you to specify which CUDA devices model replicas will be placed on (docs here).

+1 to @osalpekar’s comment.

One thing I want to add is that, when DDP was initially introduced, it has two modes:

  • single-process multi-device (SPMD): each process exclusively works on multiple GPUs, and hence there will be multiple model replicas within the same process. In this case, the device_ids should be a list of GPUs one process should use.
  • single-process single-device (SPSD): each process exclusively works on one GPU, i.e., each process works on a single model replica. In this case, device_ids should only contain a single device.

As SPSD is almost always the recommended way to use DDP due to perf reasons, we are planning to retire SPMD mode soon. If there are concerns, please comment here: https://github.com/pytorch/pytorch/issues/47012