Automatic differentiation is really helpful - but sometimes manual differentiation can be better.
I heavily optimized my network structure use inplace operations, and I got a 40x speedup in the forward phase. The backward phase is still very slow, and now it takes about 90% of the time required by one step.
I’ve looked at the graph generated by .backward(), and it’s sprawling!!! Many of the sub-graph are identical and differ only in the input vector (the same sub-network is referred to multiple time). I think could get a significant speedup by reducing all these identical graphs to just one, and by combining the input gradients.
The question is, how can I do that?