The multi-GPU functions (which stand for multiple GPUs per CPU thread) are deprecated. As of today, PyTorch Distributed’s preferred programming model is one device per thread, as exemplified by the APIs in this document.
Thanks fot the reply @ptrblck. However, the docs talk about threads not processes. So GIL would be a problem in the case we have one GPU per thread, not one thread accessing multiple GPUs. What am I missing? Except it it is one single-threaded process per GPU, hence one thread per GPU.
If a single thread is responsible to drive the work of multiple devices, you would be back at nn.DataParallel with its shortcoming of serialized launches etc. A lot of models in DDP already suffer from CPU overheads (the CPU is not fast enough to launch the work on the GPU) and thus benefit from using CUDA Graphs. If you put more pressure on the CPU by forcing it to launch the work of all GPUs, the overhead would become even worse and the launches would also be sequential (unless I’m missing something now).
If you put more pressure on the CPU by forcing it to launch the work of all GPUs, the overhead would become even worse and the launches would also be sequential (unless I’m missing something now).
This is a different problem from the one posed by the Python GIL though, right?