Understanding graphs and state

apaszke · January 27, 2017, 10:32pm

Yes. We don’t guarantee that the error will be raised, but if you want to be sure that you can backprop multiple times you need to specify retain_variables=True. It won’t raise an error only for very simple ops like the ones you have here (e.g. grad_input of add is just grad_output, so there’s no need for any buffers, and that’s why it also doesn’t check if they were freed). Not sure if we should add these checks or not. It probably doesn’t matter, as it will raise a clear error, and otherwise will still compute correct gradients.
Yes, when you use the same net with the same input twice, it will construct a new graph, that will share all the leaves, but all other nodes will be exact copies of the first one, with separate state and buffers. Every operation you do on Variables adds one more node, so if you compute a function 4 times, you’ll always have 4x more nodes around (assuming all outputs stay in scope).

Now, here’s some description on when do we keep the state around:

When finetuning a net, all the nodes before the first operation with trained weights won’t even require the gradient, and because of that they won’t keep the buffers around. So no memory wasted in this case.
Test on non-volatile and detaching the outputs will keep the bottom part of the graph around, and it will require grad because the params do, so it will keep the buffers. In both cases it would help if all the generator parameters would have requires_grad set to False for a moment, or a volatile input would be used, and then the flag would be switched off on the generator output. Still, I wouldn’t say that it consumes more memory on every fw-pass - it will just increase the memory usage, but it will be a constant factor, not like a leak. The graph state will get freed as soon as the outputs will go out of scope (unlike Lua, Python uses refcounting).

There’s however one change that we’ll be rolling out soon - variables that don’t require_grad won’t keep a reference to the creator. This won’t help with inference without volatile, and it will still make the generator graph allocate the buffers, but the will be freed as soon as the output is detached. This won’t have any impact on the mem usage, since that memory would be already allocated after the first pass, and it can be reused by the discriminator afterwards.

Anyway, the examples will need to be fixed. Hope this helps, if something’s unclear just let me know. Also, you can read more about the flags in this note in the docs.