One GPU per Thread vs multiple GPUs per CPU thread

PyTorch docs make the following statement:

The multi-GPU functions (which stand for multiple GPUs per CPU thread) are deprecated. As of today, PyTorch Distributed’s preferred programming model is one device per thread, as exemplified by the APIs in this document.

(see here: Distributed communication package - torch.distributed — PyTorch 2.3 documentation)

My question is why is that? What are the benefirst of the one device per thread programming model and what are the tradeoffs?

Python’s GIL would block the threads slowing down your code, which is why we recommend using a single process per device.

Thanks fot the reply @ptrblck. However, the docs talk about threads not processes. So GIL would be a problem in the case we have one GPU per thread, not one thread accessing multiple GPUs. What am I missing? Except it it is one single-threaded process per GPU, hence one thread per GPU.

If a single thread is responsible to drive the work of multiple devices, you would be back at nn.DataParallel with its shortcoming of serialized launches etc. A lot of models in DDP already suffer from CPU overheads (the CPU is not fast enough to launch the work on the GPU) and thus benefit from using CUDA Graphs. If you put more pressure on the CPU by forcing it to launch the work of all GPUs, the overhead would become even worse and the launches would also be sequential (unless I’m missing something now).

If you put more pressure on the CPU by forcing it to launch the work of all GPUs, the overhead would become even worse and the launches would also be sequential (unless I’m missing something now).

This is a different problem from the one posed by the Python GIL though, right?