[Stupid Question] Why do you have to detach the hidden state of LSTMs, but not the hidden state of a linear network?

I’m an absolute beginner, trying to understand LSTMs.

I’ve read how if you don’t detach the hidden state of an LSTM some graph used for the propagation of gradients gets really big. Why does this not happen with a linear classifier?

I thought LSTMs are unrolled through time and after that are an acyclic computation graph and can be trained as usual, but apparently that’s somehow wrong, how exactly?

Also in the current https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html they do not seem to do the detaching, at least there is no call to .detach(). What’s going on here? Are they propagating the gradient through the whole sequence?

The hidden state in an LSTM is suppose to serve as “memory”. We start off with an initial hidden state, but this hidden state isn’t suppose to be learned, so we detach it to let the model use those values but to not compute gradients w.r.t.

In the example you linked, it looks like they construct the initial hidden state each time. So there’s no concerns about those parameters being adjusted.

Could you elaborate on what you mean this doesn’t occur with a linear classifier? It could be that there’s a double usage of the word “hidden” which can mean the hidden state of a neural network (so the layers in between) or the hidden state being the memory of an LSTM.

Thank you for your reply.

Could you elaborate on what you mean this doesn’t occur with a linear classifier? It could be that there’s a double usage of the word “hidden” which can mean the hidden state of a neural network (so the layers in between) or the hidden state being the memory of an LSTM.

I was wondering whether a set of weights for example would be connected to two autograd graphs
if it appeared in two forward passes. The answer to my question seems to be that if you do
y = x * 2, x.grad_fn is None and x is connected to the autograd graph only through y, where the connection is stored in y.grad_fn. If you do a new forward pass with x a new y will be calculated with a new grad_fn, so the autograd graphs are not connected across multiple inputs.

I will try to explain what I believe to have understood, in part to further my own understanding.
I’ve asked about the behavior of autograd in linear classifiers, because I did not understand why the reasons for detaching the hidden state do not apply to linear classifiers, too.

In an LSTMs you input a large sequence, but cut up into smaller parts. The same hidden state tensor is used over the whole sequence. So if I have two small parts p and q of the sequence and have done the forward pass for p, I have a hidden state with a grad_fn that connects it to the forward pass for p.
If I use this same hidden state as the first sate of q, some grad_fn in the forward pass of q will point to the hidden state, which in turn still points to the graph for p! This way the autograd graphs get connected. z = y.detach() is the same thing as y, just with z.grad_fn set to None! So that’s how you can disconnect them.

I’d also like to answer my other own question here:

I thought LSTMs are unrolled through time and after that are an acyclic computation graph and can be trained as usual, but apparently that’s somehow wrong, how exactly?

They are not unrolled through time explicitly. The autograd graph exists only through the grad_fn fields in tensors. So there is no global graph stored anywhere, tensors only remember where they have to send the gradients. That way one can have y = x2 and z = x3 and have (implicitly) two autograd graphs!

It sounds to me that you have the right understanding of everything. Because y is independent, that is, not computed using anything else, it’s essentially the “start” of the autograd graph. You’re right in saying that using a new input will result in a “new” autograd graph that is not connected.

In an LSTM, the hidden state is explicitly computed using the previous part p. Therefore, once we’re starting to work on q, it utilizes that hidden state and therefore depends on p (or at least, how we utilized p to compute this hidden state).

I’m still not sure what you mean with detaching the the hidden state, I don’t believe we do this while in the middle of a sequence since we want to adjust how we compute this hidden state.