Hi,
I’m now a research assistant and I want to replicated a experiment which had used distributed data parallel method. I have also heard good things about it and therefore I kinda want to try it as well. However, I noticed that in the documentation, it says I have to ensure that the program has exclusive access to the GPU when I’m using NCCL backend (the source code of the project had also used barrier(), which was only supported by nccl without extra steps like compiling). Is the exclusiveness includes other process running on GPU that’s not distributed? Like can I have a small model (probably using non-distributed dataparallel or simple single gpu training) on the same GPU? I just want to make sure and ask before I try because it’s shared and I’m afraid I might crash others’ programs. I have looked online but I can only see people reporting problems when they’re trying to run two distributed instance on a GPU.