Training slower on GPU than on CPU

CherubimHD · June 1, 2021, 5:04pm

Hi,

I have a regular DDQN algorithm with a custom environment. Interestingly, the training is slower on the cloud machine I have access to through my university than on my own laptop. The cloud machine has a GPU and my laptop does not.
At first I thought that might be due to constantly moving the newest state from the environment to the GPU in order to run the model with that state as input. However, I have now completely rewritten the environment to use torch and execute basically all computations directly on the GPU, i.e. there is virtually no place anymore where I have to move a tensor from CPU to GPU (I searched my whole project for any moving of tensors and there is no such operation as I also initialize most tensors directly on GPU). When I check the device of the most important tensors in my code I get that they are on ‚cuda‘.

I really dont know where the bottleneck of my code on the cloud machine could be. Is it possible that the GPU execution is slower than the CPU execution because during simulation I basically have a batchsize of 1 (and only during replay I have higher batchsizes)?

I‘m grateful for any suggestions!

PS: I cant really use the torch performance profiler since it would require a built from source on the cloudmachine and I really don‘t want to do this.

EDIT: The difference in speed is quite significant as training on my laptop is twice as fast.

ptrblck · June 2, 2021, 5:14am

I don’t think you need to rebuild PyTorch from source to be able to use the profiler (via Kineto), but might be wrong. In any case, profiling via e.g. Nsight Systems would also work using the PyTorch binaries, which would help isolate the bottleneck.

raharth · June 22, 2021, 5:28pm

I have seen a similar problem when using very small environments and NNs. In my case it was a 10 node(?) model with a single hidden layer.
In my case it was partially due to the swapping, but also my model was as small that it was not able to benefit from the parallel execution of the GPU, instead it suffered from the much lower clock rate.

So in case that your model is REALLY small this might be helpful