How could I sync numpy objects in distributed mode


I am working in the distributed mode (one process per gpu). I find in the documents that I could sum the same tensor in different gpus with torch.distributed.all_reduce(tensor), but the tensor summed can only be cuda tensors. Is there any convenient way to sum two numpy matrices from different processes?