Batch normalization inference performance on CPU

I am using CPU for inference because it’s faster in my case, maybe because my model has few weights and my CPU is much better than my GPU.
The problem:
A batch normalization layer takes about twice as long as a fully connected layer, I don’t think that should be the case. Using GPU the batch normalization layers are a bit faster than the fully connected layers. Is there anything I can do to improve the performance on CPU, particularly for the batch normalization layers?