Is Pytorch autograd tape based?

In the documentation (and many other places online) is stated that autograd is tape based:

but in Paszke, Adam, et al. “Automatic differentiation in PyTorch.” (2017) is clearly stated:

So I guess it’s not?

1 Like

I think what this means is that PyTorch and Chainer are smarter than “add everything to a global giant list” in that they store a graph of operations and do differentiate between paths where they need to backprop through and stuff they don’t need - for example if you are finetuning a convnet by just replacing the final layer, only the graph for this last bit from final layer to loss will be saved and backpropagated through.
Also, you can use the “selective recording” (my unscientific language) for neat things like differentiating implicit functions and trading compute for memory.
Quite likely, the nuances of no tape vs. advanced, modern interpretation of “tape” are as opaque to laypersons like me as they are clear to experts in the field, but one would perhaps expect that in 2017 you know a trick or two about great data structures and thread safety and stuff beyond what Wengert had in 1964.

Best regards



In pytorch, there is no traditional sense of tape. In the engine, we queue up the backward jobs as soon as all its dependencies are satisfied. So it is not reversing a sequence of operations, but still executing a topological sorted order. This way we can use multi thread to execute these tasks easily (if they don’t conflict with one another).

In, Is the following description appropriate?

a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch