torch.distributed to train my model.
When I use
torch.multiprocessing.set_start_method('spawn'), the gpu usage memory will be increased with the increasing
However, when I don’t use
torch.multiprocessing.set_start_method('spawn'), the gpu usage memory is consistent with different
Therefore, should I use
spawn to start multi-processing ?
What’s the influence of the
Why the increasing
num_workers increases the gpu usage memory when
When using GPU, I believe
spawn should be used, as according to this multiprocessing best practices page, CUDA context (~500MB) does not fork. This could also be the reason why you see increasing GPU memory footprint when using more spawned processes, as each process will have its dedicated CUDA context.
Curious, can you allocate a different GPU to each different process? Or do they have to use the same GPU in your application?
For distributed parallel training, we do allocate one process for one GPU to train model.
However, in the dataloader, because it also adopts the multi-processing, the increasing
num_workers will increase the GPU memory footprint.
In my opinion, the increased GPU memory footprint due to the increasing
num_workers can not help to training (faster of better performance).
If those dataloader processes do not use GPU, I guess they can use fork instead?
cc @vincentqb for DataLoader questions.
I want to bump this post, I’m having this exact problem right now. Each additional worker my processes are spawning for data loading is resulting in an increase of about 500 MiB per worker.
Does anyone know how to fix this?
The 500MB is about the size of a CUDA context. Does those processes use GPUs?
Specifically, I’m trying to run my code with 2 GPUs, thus I spawn two processes. Each initialize their own DataLoader object with
num_workers=2. I find that there’s 6 processes in total using GPU memory, I definitely expect the two main processes to utilize GPU memory but I don’t understand why the dataloader worker processes are also utilizing GPU memory.
If I run my code without DDP, I do not see this issue.