Hello

I’m training my small many-to-one LSTM network to generate chars one at a time.

my training data are splited into mini-batches, each batch has the following shape

[batch size, sequence Length, Number of feature]

with batch_first=True in my LSTM unit

Now, after forward feeding my mini-batch to the network and calculating CrossEntropyLoss(), i call

Loss.backward() just one time so the gradients gets calculated and stored in .grad attribute.

My questions is :

consider a batch size of 16, isn’t gradient should be calculated 16 times (after each sequence in mini-batch) and accumulated then Averaged before updating the parameters using (optimiser.step()).

do Loss.backward() calculate the gradients across each mini-batch sequence or just calculate it one time after last sequence in batch. if so, how can i calculate gradient after each sequence in mini-batch then update weights by average gradients at the end.

Here is my training loop:

```
for epoch in range(num_epochs):
for startidx in range(0, num_batches, batch_size):
endidx = startidx + batch_size
step = startidx//batch_size
xbatch = xhot_seq[startidx:endidx]
ybatch = yhot_seq[startidx:endidx]
# Forward pass
# Initialise hidden state
hidden = model.init_hidden()
# Clear stored gradient
model.zero_grad()
y_pred, hidden = model(xbatch, hidden)
target = torch.argmax(ybatch.long(),dim=1)
loss = loss_fn(y_pred, target )
loss_hist[epoch] = loss.item()
# Backward pass
loss.backward()
# Update parameters
optimiser.step()
```

Thanks