Hi @ptrblck , we generally use Nvlink for Parallelism and the Horovod framework for distributed training, where a common task will be executed on multiple processors. Do we have any connection between these two? What I mean is, can we connect the Horovod framework with Nvllink? Please, do help me if you know anything about this.
Thank you
Not familiar with Horovod implementation. But if it internally uses PyTorch ProcessGroup
or DistributedDataParallel
, it would work with NVLink, if you specify the nccl
backend when calling init_process_group
.
Thank you so much @mrshenli for your reply. Could you please share some of the code related links with me, if you have any.
Thank you
Sure. Here two tutorials with code c10d, ddp, and a minimum ddp example.
Thank you so much @mrshenli for your reply, I will work on it.