If the graph to be optimized needs Gradient,
runNondiffOptimization(gradient.f) will be called.
runNondiffOptimization() runs some optimizations including BatchMM and FuseGraph.
I wonder why are these optimization non-differentiable?
FuseGraph as an example. It fuses continuous point-wise ops.
If it can fuse
f(g(x)), why can’t it fuse
df(g(x)) * dg(x).
Or it’s simply because the current implementation of FuseGraph does not support this?