Multi-node multi-GPU training, partly NVlink

I want to run some multi-node multi-GPU training where some GPUs are connected via NVlink but potentially/probably not all of them (but I don’t really know in advance).

How would I ideally do that with PyTorch?

For the reduce, I ideally would want that it does it in the most efficient way possible, i.e. first reduce over the NVlink connected subsets as far as possible, and then over network, and then broadcasted again over NVlink, or sth like that.

Using NCCL probably does not work? So Gloo then? Or both together, and doing it all in a more custom way?

Doing average on gradients is probably also not the most efficient, so I was thinking to do model parameter average instead. (Just like this.)

NCCL should detect all NVLinks and use these if possible. You might want to remap the devices via CUDA_VISIBLE_DEVICES if needed to make sure your desired communication operations are using NVLink for heavy comm ops.

There are sometimes cases where CPU ↔ ethernet ↔ CPU reduction would still be faster. E.g. we do epoch-based training, and the data loaders on the workers produce not necessarily the same amount of batches. So, at the beginning of every step, we check if every worker has still data (otherwise we finish the epoch). This is the code:

_has_data = torch.tensor([extern_data_raw is not None], dtype=torch.int8)
if self._torch_distributed_ctx:
    # use all reduce to check if all workers have data, if at least one worker does not have data,
    # all workers finish this epoch
    torch.distributed.all_reduce(_has_data, op=torch.distributed.ReduceOp.MIN)
if not _has_data[0]:
    break

Originally we used NCCL here, and moved the tensor to GPU, and then back to CPU. This was much slower than keeping this code on CPU, and using dist.init_process_group(backend=None) to init both NCCL and Gloo.

This makes sense to me. Or do you think this is unexpected and using NCCL should always just be as fast? If yes, then what could be the problem? If no, this means, we should better use Gloo and leave tensors on CPU for those cases?

I don’t fully understand why you are moving data from the GPU to the CPU in the first place as it could easily cause the bottleneck. Due to this NCCL might not underperform but your data movement.

In the example with _has_data, this is on CPU. And here, I guess Gloo would make more sense, right?

But for all data which is anyway on GPU, I can (should) use NCCL, no matter if NVlink is available or not?

I don’t know as I’m not using gloo and usually don’t want to hold data on the CPU.

Yes, NCCL should give you the best performance for data stored on the GPU.