Reduce training time with large sample size and small batch size

I have a data set of about 15 million samples, and I noticed that the neural network I am training converges the best with a relatively small batch size of 128. The problem is, if I use this small batch size, it takes really long time to go through the training sample, while it only takes about one fourth of my GPU memory, and the GPU utitily is also about 30%. Is it possible to reduce the training time through some sort of parallelism? The common data parallelism seems to split a large batch sample to multiple GPUs, which is not what I am looking for. Many thanks!