I’m training my small many-to-one LSTM network to generate chars one at a time.
my training data are splited into mini-batches, each batch has the following shape
[batch size, sequence Length, Number of feature]
with batch_first=True in my LSTM unit
Now, after forward feeding my mini-batch to the network and calculating CrossEntropyLoss(), i call
Loss.backward() just one time so the gradients gets calculated and stored in .grad attribute.
My questions is :
consider a batch size of 16, isn’t gradient should be calculated 16 times (after each sequence in mini-batch) and accumulated then Averaged before updating the parameters using (optimiser.step()).
do Loss.backward() calculate the gradients across each mini-batch sequence or just calculate it one time after last sequence in batch. if so, how can i calculate gradient after each sequence in mini-batch then update weights by average gradients at the end.
Here is my training loop:
for epoch in range(num_epochs): for startidx in range(0, num_batches, batch_size): endidx = startidx + batch_size step = startidx//batch_size xbatch = xhot_seq[startidx:endidx] ybatch = yhot_seq[startidx:endidx] # Forward pass # Initialise hidden state hidden = model.init_hidden() # Clear stored gradient model.zero_grad() y_pred, hidden = model(xbatch, hidden) target = torch.argmax(ybatch.long(),dim=1) loss = loss_fn(y_pred, target ) loss_hist[epoch] = loss.item() # Backward pass loss.backward() # Update parameters optimiser.step()