Is minibatching on CPU supposed to be slower with reater batch size?

I am currently using minibatching for the first time with pytorch. It was quiet a mess to implement with padding (its an NLP system that uses sentences as wordwise input to RNNs) and now that its done, I wonder if the fact that on my CPU the learning is slower with increasing batchsize is due to my poor implementation (and therefore may transfer to GPU) or due to the interaction between minibatches (and its matrix multiplication) and CPUs.

I can currently NOT test what happens if I would run the network on GPU.

Actually it could be the implementation. If you’re looping over the batch, the CPU is not working in parallel. The batch should always be a dimension in the tensors that are multiplied.

With GPU the implementation can be different because you can make multiple calls and get parallel work. Greater batches are faster on GPU depending on the specs.