I have a regular DDQN algorithm with a custom environment. Interestingly, the training is slower on the cloud machine I have access to through my university than on my own laptop. The cloud machine has a GPU and my laptop does not.
At first I thought that might be due to constantly moving the newest state from the environment to the GPU in order to run the model with that state as input. However, I have now completely rewritten the environment to use torch and execute basically all computations directly on the GPU, i.e. there is virtually no place anymore where I have to move a tensor from CPU to GPU (I searched my whole project for any moving of tensors and there is no such operation as I also initialize most tensors directly on GPU). When I check the device of the most important tensors in my code I get that they are on ‚cuda‘.
I really dont know where the bottleneck of my code on the cloud machine could be. Is it possible that the GPU execution is slower than the CPU execution because during simulation I basically have a batchsize of 1 (and only during replay I have higher batchsizes)?
I‘m grateful for any suggestions!
PS: I cant really use the torch performance profiler since it would require a built from source on the cloudmachine and I really don‘t want to do this.
EDIT: The difference in speed is quite significant as training on my laptop is twice as fast.