Using pin_memory with DistributedDataParallel

To fully reap the benefits of using pin_memory=True in the DataLoader, it is advised to modify the CPU to GPU transfers to be non_blocking=True (as advised here). However, when using DataDistributedParallel, I am no longer the one manually calling the .to(device) - the model’s forward is changed to move the inputs to the same device.

When I use them both, does DDP automatically know I am using pin_memory=True and changes to non_blocking=True? Do I need to modify anything else myself?
Also, should using pin_memory without non_blocking=True also give me an improvement, or is it just when I combine both?


That’s not the common approach and you can still call the to() operation on the data outside the model.

Is it not the default when I supply device_ids? I see a call to to_kwargs from pre_forward called by forward (in here). How do I disable/enable it?

You can refer to e.g. the ImageNet example or most of other examples which do not change the model architecture or the forward pass and just wrap the model in the DDP wrapper.

I am still confused. In the example code you shared, the model is still wrapped with DDP in the line:

model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

As far as I understand, this means that when we call forward, we go to:

    def forward(self, *inputs, **kwargs):
        with torch.autograd.profiler.record_function("DistributedDataParallel.forward"):
            inputs, kwargs = self._pre_forward(*inputs, **kwargs)
            output = (
                self.module.forward(*inputs, **kwargs)
                if self._delay_all_reduce_all_params
                else self._run_ddp_forward(*inputs, **kwargs)
            return self._post_forward(output)

i.e. before our forward is called (self.module.forward), pre_forward will be called, moving inputs to devices.
What am I missing?

If you are sticking to the standard approach of moving the data inside the DataLoader loop to the rank where should the data be moved to again inside the forward method? By then it’s already on the desired rank isn’t it?

1 Like

So if I call .to(device, non_blocking=True) from my own training loop, and then DDP calls .to(device) (without non_blocking), does the second call not make my code wait for the device?
I thought (but I may be wrong here) that the gain from non_blocking=True is meaningful when there is overhead in my code that happens between moving the data to the device and starting the calculations on that data, e.g. inside the forward call. But maybe the assumption here is that if any such overhead exists, its before the call to forward, and in my case, it may be inside my forward - so I want to make sure that I do not miss the optimization if I use DDP.

No, since your data is already on the device and thus calling to(device) is a no-op.

The earlier to move the data to the device asynchronously w.r.t the host the more work it can perform and overlap before the first device operation.
I would assume moving the data inside the forward with non_blocking=True might not help at all as most likely the next call is a CUDA operation depending on the data or which operations do you want to overlap in the forward?