Training model on GPU is slower than CPU

Hello, I’m new to pytorch and having some problems understanding why my models takes longer time to train on GPU than on CPU. I’m running the experiments on a server that has a Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz cpu and 2x tesla k40m. When training the model on the CPU it takes approximately 5s to compute optimizer.backwards() and optimizer.step() while when using 1 GPU it takes ~20s. I’ve tried to profile those two specific lines with the build in profiler. When reading the results I can see that the CUDA time and CUDA total time are 0 which mean that the backprop is not performed on the GPU.
Did any of you encoder something similar or have any ideas where the problem might be (or what am I doing wrong)?