I’m writing code for distributed training on multiple GPUs. However, my local devbox has only 1 GPU. Is it possible to emulate a multi-GPU setup on a 1-GPU devbox to test the code (esp. the parts that have collective communications) locally?
Hey @justinliu, you can use gloo
which is a cpu backend (Distributed communication package - torch.distributed — PyTorch master documentation) which supports a lot of the same collectives as nccl. Does that work?
There is currently no way to virtualize multiple gpus from 1 gpu in pytorch. Some nvidia gpus support MIG (NVIDIA Multi-Instance GPU(MIG)) however the collective communications in nccl do not support this
1 Like
Got it. This makes sense. Thank you!