Inner loops on GPUs vs CPUs

In the training step and part of the model’s architecture, the code consists of an inner for loop that iterates over steps in a single sequence, one step at a time.

By profiling this part on both CPUs and GPUS, I noticed a big difference in the time it takes to iterate over the sequence, on CPU: 0.2-0.4s; on GPU: every single iteration takes around 0.65s. This is slowing down the overall execution of the code on the Slurm (I am using DistributedParallelData).

As far as I know, data transfer from GPU to CPU mid-training is not worth the struggle. I do not know exactly what should be done in this case. I tried both nccl and gloo backends, but both are slow.

I made a naive mistake where inside the for loop, I used to(device) to transfer each data point to the GPU, separately, instead of at once, which I believe contributed to the overhead and time.