How to share a value across different gpu?

I am writing a custom module. I want to compute the batch statistics (like running mean in Batch Normalization) by the main GPU and then distributed to other GPUs. How to share a value across different GPUs?

If you want to manually send different payloads to the GPU each one you just had to do:

(tensorX or model).to(“cuda:0”)
(tensorX or model).to(“cuda:1”)
Then you manage each model manually on your code.

But if you prefer this information are done automatic, you just set your devide to “cuda” this will use all your GPUs and wrap your model on DataParallel and the job will be split on the GPUs

device = “cuda”
model = MyModel()
model = DataParallel(model)

1 Like

I run a model on multi-GPU. I define a custom convolutional layer. One of the parameters in the convolution is updated by some formula in the main GPU. After updating this parameter, I need to send the new value of this parameter to other GPU. But how to do this?

Hi …

about the data tensor, you also need to send to device, then if it’s a matrix the pytoch will manage to split the matrix in a half for each GPU.


But if you are trying to control each GPU individually then you need to create a sub model of each set of GPU and just send the information to the specific device

1 Like

You can use torch.distributed for this purpose(tutorial), over all steps should be like that:
I commented parts that are not directly related to sending data.

  1. a function that creates communication channel for distributed computing, simply help each GPU/CPU/Node to find each other, we will use a file that everyone can write, (file should not exists before this function runs)
def init_processes(rank, size, fn, backend="gloo"):
	""" Initialize the distributed environment. """
	print("started init at{}".format(rank))
	# we need those so process can talk to each other including over a network
	print("end init at{}".format(rank))
  1. Then you will create n process to handle n GPUs:
if __name__ == "__main__":
    #lets say you have two gpu
    size = 2
    processes = []
    for rank in range(size):
       # each process have a rank starting from 0 and they will be in charge of each gpu
       # each of them will call a function named run after they setup communication channel
        p = Process(target=init_processes, args=(rank, size, run))

    for p in processes:

  1. Now you will write a function that will handle rest of the stuff in each GPU, probably you already have code that covers training etc, you will wrap them by that function so each GPU does the training. When it comes to your custom model, it should also accept a parameter that indicates if it is running on the main GPU, then it will calculate and send data to other GPUs

def run(rank, size):
    #read data 
    # create the model model = Net()
    # choose an optimizer = optim.SGD(model.parameters(),
                         # lr=0.01, momentum=0.5)

    # start training 
   for epoch in range(10):
        epoch_loss = 0.0
       # for data, target in train_set:
            output = model(data,rank) # here your model will get rank of the gpu as an input
            #loss = F.nll_loss(output, target)
            #epoch_loss += loss.item()
        #print('Rank ', dist.get_rank(), ', epoch ',
              #epoch, ': ', epoch_loss / num_batches)
  1. Then your custom module will handle calculation of that value and sending to others,
class TwoLayerNet(torch.nn.Module):
    #def __init__(self, D_in, H, D_out):
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
       # super(TwoLayerNet, self).__init__()
        #self.linear1 = torch.nn.Linear(D_in, H)
       # self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x, rank):
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        if rank ==0:
            # send it to other GPUs, myvalue should be a tensor
            myvalue = torch.tensor(myvalue)
            myvalue = torch.zeros(1)
        # use myvalue here
        #h_relu = self.linear1(x).clamp(min=0)
        #y_pred = self.linear2(h_relu)
        #return y_pred

I find that BatchNorm has similar behavior. When using the DataParallel, BatchNorm update running mean and running variance at each replica. However, each replica free the running mean and running variance buffer at the end of the iteration. Moreover, weights and buffers of the replica on device[0] share storage with those of the input model! Therefore, BatchNorm only uses the batch statistics on device[0], and DataParallel will replicate the weight and buffer to other GPUs.