To fully reap the benefits of using
pin_memory=True in the
DataLoader, it is advised to modify the CPU to GPU transfers to be
non_blocking=True (as advised here). However, when using
DataDistributedParallel, I am no longer the one manually calling the
.to(device) - the model’s forward is changed to move the inputs to the same device.
When I use them both, does DDP automatically know I am using
pin_memory=True and changes to
non_blocking=True? Do I need to modify anything else myself?
Also, should using
non_blocking=True also give me an improvement, or is it just when I combine both?
That’s not the common approach and you can still call the
to() operation on the data outside the model.
Is it not the default when I supply
device_ids? I see a call to
pre_forward called by
forward (in here). How do I disable/enable it?
You can refer to e.g. the ImageNet example or most of other examples which do not change the model architecture or the
forward pass and just wrap the model in the
I am still confused. In the example code you shared, the model is still wrapped with DDP in the line:
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
As far as I understand, this means that when we call forward, we go to:
def forward(self, *inputs, **kwargs):
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
output = (
else self._run_ddp_forward(*inputs, **kwargs)
i.e. before our forward is called (
pre_forward will be called, moving inputs to devices.
What am I missing?
If you are sticking to the standard approach of moving the data inside the
DataLoader loop to the
rank where should the data be moved to again inside the
forward method? By then it’s already on the desired rank isn’t it?
So if I call
.to(device, non_blocking=True) from my own training loop, and then DDP calls
non_blocking), does the second call not make my code wait for the device?
I thought (but I may be wrong here) that the gain from
non_blocking=True is meaningful when there is overhead in my code that happens between moving the data to the device and starting the calculations on that data, e.g. inside the
forward call. But maybe the assumption here is that if any such overhead exists, its before the call to
forward, and in my case, it may be inside my
forward - so I want to make sure that I do not miss the optimization if I use DDP.
No, since your data is already on the
device and thus calling
to(device) is a no-op.
The earlier to move the data to the device asynchronously w.r.t the host the more work it can perform and overlap before the first device operation.
I would assume moving the data inside the
non_blocking=True might not help at all as most likely the next call is a CUDA operation depending on the data or which operations do you want to overlap in the