Implementing Backpropagation Through Time

Hello,

I want to try an easier way to practice a recurrent network of fix length sequence input, and identical length sequence output. I make a gif animation to illustrate the forward and backward flow of backpropagation through time as below to ensure I know the concept clearly.


My question is should I compute the gradients ∇_{U_i}L, ∇_{V_i}L, ∇_{W_i}L of each time step simultaneously, sum them together and then do parameter updates using the summed gradient as shown in animation? Or should I do it sequentially? I mean compute gradient of time step 8 and then back propagate to time step 7, and so on till time step 1.
Which one is correct?

This problem might not be directly related to pytorch, and since I am using pytorch to practice it, I hope to know how to obtain the gradients for each time step in pytorch.