# Function composition

Given

$$g_u \circ f_v \circ \cdots \circ f_v = g_u \circ (f_v)^n$$


with the learned parameters $u$ and $v$. The derivative is

$$\frac{d g_u}{du} \circ \left( \frac{f_v}{dv} \right)^n ~.$$


I’m interested in whether the vector of weights $v$ is updated more than once or if anything else takes more time than derivating the similar expression $g_u \circ f_v$ that doesn’t introduce composition.

If you use a tensor more than once those operations involving the tensor will also be recorded to the graph, so all contributions will be properly accumulated. Not sure if that answers your question.

Hello @soulitzer,

I’m not yet familiar with the internals of Autograd. Please correct me when I’m wrong. You say that PyTorch maintains the computational graph whenever we implement such a calculation. My question is: In the above case of $n$ compositions, does the graph contains $n$ copies of $f$?

Our motivation is to understand whether backpropagating through the expression $f_w^n$ (i.e., applying $n$ times the function $f_w$) is currently more expensive than through $f_w$ (i.e., a single forward pass)—although both should update the learned weights $w$ only once.
If backpropagating the compositions takes more time (as we suspect), it means a bottleneck for our research, and therefore we want to develop a solution.

Thanks!

Yes if you just compose f n times in forward, then it would appear n times in the bw graph, but not sure you are able to avoid that no matter what you do?

Why? Can’t we implement backpropagation ourselves? (Namely, alter PyTorch to work with our computational graph—which is efficient for compositions.)
Thanks!

Oh right, I guess what I mean is that no general solution exists. Though if you can solve the backward analytically for your particular graph that could be more efficient, e.g. you could use a custom autograd Function.