How does autograd merges 'parallel paths'?

loryruta · February 11, 2025, 11:19am

I’m trying to manually compute the gradients of the loss function at the top:

The loss is the combination of a L1 loss and L_{dssim} loss and both are function of my model’s prediction y(θ).

Therefore I have two ways of using the Chain Rule to calculate dL/dθ. As I have highlighted in blue. Which I imagine to be two parallel paths in the computational graph.

How do I combine them? – Or better, how does PyTorch do so I can replicate it?

soulitzer · February 11, 2025, 11:18pm

The total gradient is the sum of the gradients from each of the paths

Geremia · February 12, 2025, 11:23pm

Why sum and not product? The chain rule involves products of derivatives.

soulitzer · February 13, 2025, 10:46pm

yes, connecting two paths sequentially would be product, per the chain rule, but connecting two paths in parallel would be sum