In the training step and part of the model’s architecture, the code consists of an inner for loop that iterates over steps in a single sequence, one step at a time.
By profiling this part on both CPUs and GPUS, I noticed a big difference in the time it takes to iterate over the sequence, on CPU: 0.2-0.4s; on GPU: every single iteration takes around 0.65s. This is slowing down the overall execution of the code on the Slurm (I am using DistributedParallelData).
As far as I know, data transfer from GPU to CPU mid-training is not worth the struggle. I do not know exactly what should be done in this case. I tried both nccl and gloo backends, but both are slow.