I am currently using minibatching for the first time with pytorch. It was quiet a mess to implement with padding (its an NLP system that uses sentences as wordwise input to RNNs) and now that its done, I wonder if the fact that on my CPU the learning is slower with increasing batchsize is due to my poor implementation (and therefore may transfer to GPU) or due to the interaction between minibatches (and its matrix multiplication) and CPUs.
I can currently NOT test what happens if I would run the network on GPU.