Hello
I’m training my small many-to-one LSTM network to generate chars one at a time.
my training data are splited into mini-batches, each batch has the following shape
[batch size, sequence Length, Number of feature]
with batch_first=True in my LSTM unit
Now, after forward feeding my mini-batch to the network and calculating CrossEntropyLoss(), i call
Loss.backward() just one time so the gradients gets calculated and stored in .grad attribute.
My questions is :
consider a batch size of 16, isn’t gradient should be calculated 16 times (after each sequence in mini-batch) and accumulated then Averaged before updating the parameters using (optimiser.step()).
do Loss.backward() calculate the gradients across each mini-batch sequence or just calculate it one time after last sequence in batch. if so, how can i calculate gradient after each sequence in mini-batch then update weights by average gradients at the end.
Here is my training loop:
for epoch in range(num_epochs):
for startidx in range(0, num_batches, batch_size):
endidx = startidx + batch_size
step = startidx//batch_size
xbatch = xhot_seq[startidx:endidx]
ybatch = yhot_seq[startidx:endidx]
# Forward pass
# Initialise hidden state
hidden = model.init_hidden()
# Clear stored gradient
model.zero_grad()
y_pred, hidden = model(xbatch, hidden)
target = torch.argmax(ybatch.long(),dim=1)
loss = loss_fn(y_pred, target )
loss_hist[epoch] = loss.item()
# Backward pass
loss.backward()
# Update parameters
optimiser.step()
Thanks