I came cross a problem that when I use torch.autograd.grad method in multiprocessing job, the time cost for grad function becomes quite larger than I thought. Is this intentional?
I must admit that from your description it is a bit hard for me to know what is going on. Depending on your setup, it could be because the gradient had already been computed with parallelization. This could reduce the effectiveness of multiprocessing for the backward.
The pseudo codes are like below:
def average_gradients(model): size = dist.get_world_size() for param in model.parameters(): dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM) param.grad.data /= size def train(): for ep in range(1000): optimizer.zero_grad() a = torch.tensor(xxx) # multipulating a to derive tensor b c = torch.autograd.grad(b, a, create_graph=True) loss = nn.MSEloss((b, b_target)) + nn.MSEloss((c, c_target)) loss.backward(retain_graph=True) average_gradient(net) optimizer.step() if __name__ is '__main__': processes =  size = 4 for rank in range(size): p = Process(target=train, args=(rank, size, lock)) p.start() processes.append(p) for p in processes: p.join()
I set different value of size to test, the main time consuing part is torch.autograd.gard method. The smaller value of size, the faster for this line computing.
This looks quite ok (note that you don’t need the retain_graph in the backward call).
What is problematic for you? Can you give concrete example of what is slower than you expect?
I found the problem. When you create a new subprocess, pytorch sets the maximum number of potential useable threads by default (depends on how many cores in your current node). After I set_num_threads to 1 or 2, the speed behaves like what I expected.