Hi,
To fully reap the benefits of using pin_memory=True in the DataLoader, it is advised to modify the CPU to GPU transfers to be non_blocking=True (as advised here). However, when using DataDistributedParallel, I am no longer the one manually calling the .to(device) - the model’s forward is changed to move the inputs to the same device.
When I use them both, does DDP automatically know I am using pin_memory=True and changes to non_blocking=True? Do I need to modify anything else myself?
Also, should using pin_memory without non_blocking=True also give me an improvement, or is it just when I combine both?
You can refer to e.g. the ImageNet example or most of other examples which do not change the model architecture or the forward pass and just wrap the model in the DDP wrapper.
If you are sticking to the standard approach of moving the data inside the DataLoader loop to the rank where should the data be moved to again inside the forward method? By then it’s already on the desired rank isn’t it?
So if I call .to(device, non_blocking=True) from my own training loop, and then DDP calls .to(device) (without non_blocking), does the second call not make my code wait for the device?
I thought (but I may be wrong here) that the gain from non_blocking=True is meaningful when there is overhead in my code that happens between moving the data to the device and starting the calculations on that data, e.g. inside the forward call. But maybe the assumption here is that if any such overhead exists, its before the call to forward, and in my case, it may be inside my forward - so I want to make sure that I do not miss the optimization if I use DDP.
No, since your data is already on the device and thus calling to(device) is a no-op.
The earlier to move the data to the device asynchronously w.r.t the host the more work it can perform and overlap before the first device operation.
I would assume moving the data inside the forward with non_blocking=True might not help at all as most likely the next call is a CUDA operation depending on the data or which operations do you want to overlap in the forward?