Emulate distributed training setup with 1 GPU

justinliu · March 18, 2024, 12:24pm

I’m writing code for distributed training on multiple GPUs. However, my local devbox has only 1 GPU. Is it possible to emulate a multi-GPU setup on a 1-GPU devbox to test the code (esp. the parts that have collective communications) locally?

H-Huang · March 18, 2024, 12:29pm

Hey @justinliu, you can use gloo which is a cpu backend (Distributed communication package - torch.distributed — PyTorch master documentation) which supports a lot of the same collectives as nccl. Does that work?

There is currently no way to virtualize multiple gpus from 1 gpu in pytorch. Some nvidia gpus support MIG (NVIDIA Multi-Instance GPU(MIG)) however the collective communications in nccl do not support this

justinliu · March 18, 2024, 11:19pm

Got it. This makes sense. Thank you!