More efficient forward() and backward()

dem123456789 · November 16, 2018, 8:00pm

Is it possible to do forward() and backward() more memory efficiently? For example, During the chain rule, we only move part of the computation graph to cuda device that need to be back propagated. I think the overhead here is only .to(device).

InnovArul · November 16, 2018, 10:18pm

May I know what do you mean by memory efficient?
As far as I know, Pytorch only stores the intermediate outputs for whatever weights, the gradient are needed (requires_grad=True). In one way, this seems to be doing what you ask for.
If I misunderstood your question incorrectly, can you explain with a small code, if possible?

dem123456789 · November 19, 2018, 5:10am

What I mean is to move part of those intermdediate outputs to CPU RAM and only use them when backprogation reaching them.
For example,
z = f(x), y=g(z), after I get z, I move intermediate outputs of f(x) to CPU RAM and then start to compute g(z). Same idea for backpropagation but reversely.

justusschock · November 19, 2018, 7:00am

This would not work good, since this required CPU and GPU to synchronize quite often(currently they are asynchronous) and synchronization slows down a lot. The negative effect of .to() for every intermediate output would probably be superior to the benefits of memory usage.