I have a training that works well when running with a batch of 1 but that gets “shaky” when running with a bigger batch.
For more context, it is a sequential audio-to-motion model (a bit similar to an NLP transformers model).
What are things that could be tried to “stabilize” the training?
Bigger learning rate? Some batch normalization? Bigger batch?
Any ideas are welcome, thanks in advance.