I’m training encoder-decoder module with 2 GRUs. Training algorithm is something like that:
criterion = NLLLoss(...) opt = Adam(...) for epoch in range(epochs): opt.zero_grad() for batch in split_to_batches(dataset, batch_size): ... # preparing data predicted = model.forward(batch_encoder_inputs, batch_decoder_inputs) loss = criterion(predicted, batch_targets) loss.backward() opt.step()
So, loss is accumulated through all batches and weights are updated once for 1 epoch.
When I trained network with batch_size = 200, it was okay and loss was decreasing with average speed. But when I increased batch_size to 400, it became really hard to make the loss to decrease at all. And only when I set learning rate to something like 0.0001 it started to decrease, but veeeeery slowly. I also tried to use gradient clipping (from 4 to 8) before
opt.step with no luck.
Why does it happening? The more batch_size the more the training performance should be. Am I wrong? How can I fix the training algorithm?
Thanks in advance!