You could try to accumulate the gradients using @albanD’s suggestions posted here and thus artificially create a larger batch size. This might help the convergence of your model.