Dataloader is slower if model is on gpu

I run ffmpeg (to preprocess data) as subprocess from the workers (num_workers=6). there is nothing related to gpu in dataloading, just subprocess.run ffmpeg.

I realized that if I run dataloaders when my model is on gpu, the dataloading becomes 3 to 4 times slower. I see less ffmpeg proceses in htop and less utilization overall.

The interesting part is that I don’t even enter the training loop, I commented out everything but batch retrieval loop and still if the initial model is on gpu the dataloading is slower. I verified with nvidia-smi that no processing is done at gpu.

I guees it has something to do with internal locks due to cuda or process spawn/data sharing logic but at the first glance I couldn’t find anything.

Is it ok to create a subprocess in workers which are themselves subprocesses? What happens to dataloaders if cuda is involved?

Thanks.