Hi,
I am working in the distributed mode (one process per gpu). I find in the documents that I could sum the same tensor in different gpus with torch.distributed.all_reduce(tensor)
, but the tensor summed can only be cuda tensors. Is there any convenient way to sum two numpy matrices from different processes?