Should I use 'spawn' method to start multi-processing?

I use torch.distributed to train my model.
When I use torch.multiprocessing.set_start_method('spawn'), the gpu usage memory will be increased with the increasing num_workers.
However, when I don’t use torch.multiprocessing.set_start_method('spawn'), the gpu usage memory is consistent with different num_workers.

Therefore, should I use spawn to start multi-processing ?
What’s the influence of the set_start_method('spawn') ?
Why the increasing num_workers increases the gpu usage memory when spawn mode?

When using GPU, I believe spawn should be used, as according to this multiprocessing best practices page, CUDA context (~500MB) does not fork. This could also be the reason why you see increasing GPU memory footprint when using more spawned processes, as each process will have its dedicated CUDA context.

Curious, can you allocate a different GPU to each different process? Or do they have to use the same GPU in your application?

For distributed parallel training, we do allocate one process for one GPU to train model.

However, in the dataloader, because it also adopts the multi-processing, the increasing num_workers will increase the GPU memory footprint.

In my opinion, the increased GPU memory footprint due to the increasing num_workers can not help to training (faster of better performance).

If those dataloader processes do not use GPU, I guess they can use fork instead?

cc @vincentqb for DataLoader questions.

I want to bump this post, I’m having this exact problem right now. Each additional worker my processes are spawning for data loading is resulting in an increase of about 500 MiB per worker.

Does anyone know how to fix this?

The 500MB is about the size of a CUDA context. Does those processes use GPUs?

They shouldn’t.

Specifically, I’m trying to run my code with 2 GPUs, thus I spawn two processes. Each initialize their own DataLoader object with num_workers=2. I find that there’s 6 processes in total using GPU memory, I definitely expect the two main processes to utilize GPU memory but I don’t understand why the dataloader worker processes are also utilizing GPU memory.

If I run my code without DDP, I do not see this issue.

Let’s continue discussions in DistributedDataParallel causes Dataloader workers to utilize GPU memory