I noticed that in the Word Embeddings tutorial, the N-Gram Language Modelling example zeroes out the gradient with the following line of code:
# Step 2. Recall that torch *accumulates* gradients. Before passing in a
# new instance, you need to zero out the gradients from the old
# instance
model.zero_grad()
My understanding is that this line should be optimizer.zero_grad()
. Why is this case different?
Thanks