Using DistributedDataParallel on shared GPU


I’m now a research assistant and I want to replicated a experiment which had used distributed data parallel method. I have also heard good things about it and therefore I kinda want to try it as well. However, I noticed that in the documentation, it says I have to ensure that the program has exclusive access to the GPU when I’m using NCCL backend (the source code of the project had also used barrier(), which was only supported by nccl without extra steps like compiling). Is the exclusiveness includes other process running on GPU that’s not distributed? Like can I have a small model (probably using non-distributed dataparallel or simple single gpu training) on the same GPU? I just want to make sure and ask before I try because it’s shared and I’m afraid I might crash others’ programs. I have looked online but I can only see people reporting problems when they’re trying to run two distributed instance on a GPU.

Yes, you could run into deadlocks or hard to debug hangs if your program does not have exclusive access to the GPU when using the NCCL backend. This is because synchronization done by the application can result in things such as waiting for all operations in a stream to complete, which could result in unpredictability if it ends up waiting on ops from other applications.

But according to what you say, as long as I’m the only one who’re using nccl backend, the other program shouldn’t be locked? Since those wouldn’t need to wait for anything to end. And as long as they finished my program should continue to run. Is that how it works?