I have an original training pipeline that works well with DistributedDataParallel, running on a single machine with 8 GPUs. So far, I haven’t used torchrun, I’m using DDP as-is, spawining one process per GPU in bash and specifying the ddp method/params manually on the command line.
Now I would like to use another model than the one I’m training as part of the dataloader pipeline, using multiple workers. I would like that model to also use the GPU. This is where problems start. CUDA isn’t happy to be initialized again, asking me to set the mp.set_start_medhod to ‘spawn’, but when I do that, I get ask to call ‘freeze()’ and complains again about the context already being set.
I’ve tried to move to torchrun, in case some magic would happen, but that also doesn’t work in similar ways.
I’ll try to create a toy example to share here, but I wonder if there’s something obvious I’m missing here?