DDP and using GPU in the dataloader

Hi everyone,

I have an original training pipeline that works well with DistributedDataParallel, running on a single machine with 8 GPUs. So far, I haven’t used torchrun, I’m using DDP as-is, spawining one process per GPU in bash and specifying the ddp method/params manually on the command line.

Now I would like to use another model than the one I’m training as part of the dataloader pipeline, using multiple workers. I would like that model to also use the GPU. This is where problems start. CUDA isn’t happy to be initialized again, asking me to set the mp.set_start_medhod to ‘spawn’, but when I do that, I get ask to call ‘freeze()’ and complains again about the context already being set.

I’ve tried to move to torchrun, in case some magic would happen, but that also doesn’t work in similar ways.

I’ll try to create a toy example to share here, but I wonder if there’s something obvious I’m missing here?

A small example showing your issue would be great.

One thing you should try is to switch to single process data loading: torch.utils.data — PyTorch 1.11.0 documentation

@VitalyFedyunin do you have further recommendations here?

Thanks for your quick reply. Using a single worker process to generate the data works, but it is a lot slower for reasons related to the nature of the data and data pipeline requirements.

Rewriting the training loop from scratch a bit, I’m getting something out of the pipeline now, but the processing hangs at some point and I notice that only one GPU ever gets used, as if only one worker process is effectively used.

debugging continues…

Maybe you can structure your training loop around pipeline parallelism with the loader-time model as the first stage.