How pytorch propagates gradient for LSTM


When using an LSTM, and feeding it a sequence all in one learning step, how does the backprop through time work?

For example, if I use NLLLoss, and I have an input sequence of length 12 (“time steps”), like “I like torch” (embedded), I pass the outputs of LSTM and the targets and those also have length 12. NLLLoss reduces that to a number, from which backprop can start, but what does the graph for backprop look like? How does backprop know to iterate over the 12 “time steps” here?

From what I can tell, the timesteps are now handled via for loop in C++ on the backend.