I recently implemented a model with two branches. Though the computational graphs of the two branches have no connections, they both modify a shared GPU storage when backward. The losses of the two branches are summed and so backward together. The non-derterministic backward result makes me realize that the backward of the computation graph is not single-threaded but multi-threaded.
Multi-threaded backward depending on the autograd graph is smart. But how can I identify the gradient computation of some parts of a model is multi-threaded? Are there some clear guidelines or rules? In which scope two disconnected parts will do backward in the multi-threaded way?
e.g.
a = conv1(x)
b = conv2(x)
c = a + b
c.sum().backward()
Is it certain that the gradients of a
and b
will be computed in multiple threads?