Hello, I’m new to pytorch and having some problems understanding why my models takes longer time to train on GPU than on CPU. I’m running the experiments on a server that has a Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz cpu and 2x tesla k40m. When training the model on the CPU it takes approximately 5s to compute optimizer.backwards() and optimizer.step() while when using 1 GPU it takes ~20s. I’ve tried to profile those two specific lines with the build in profiler. When reading the results I can see that the CUDA time and CUDA total time are 0 which mean that the backprop is not performed on the GPU.
Did any of you encoder something similar or have any ideas where the problem might be (or what am I doing wrong)?