cc @mrshenli in case he has more ideas about how these would work with DDP.
hi @mrshenli I made a DDP example and managed to parallelize it with DDP. But I have access to 110 CPUs not gpus. But the single process code is faster than the 10 and 110 cpu process code. Why is that? Is there something I can do to fix this?
Hey @Brando_Miranda @albanD sorry for being late to this discussion.
I made a DDP example and managed to parallelize it with DDP. But I have access to 110 CPUs not gpus. But the single process code is faster than the 10 and 110 cpu process code. Why is that? Is there something I can do to fix this?
Are those CPUs/machines or CPU cores? IIRC, PyTorch operators already parallelizes across multiple CPU cores (@albanD please correct me if I was wrong).
If its multiple cores on the same machine, how did you make sure that each DDP process exclusively operates on a set of CPU cores?
Another thing is that, it might be DDP’s CPU communication overhead is overshadowing the compute parallelization speedup. The gradient synchronization comm overhead is roughly constant, and independent to the batch size. One helpful exercise might try to increase the batch size and see the gap shrinks.
Just wanna confirm when using DDP on 10 CPUs, did you set the per process batch-size to 1/10 compared to local training?
I assume you have already figured this out. Responding here for future users. This can be done by setting the device_ids
to None or empty list. Quote from the API doc:
How is the RPC framework different from what this tutorial shows (Writing Distributed Applications with PyTorch — PyTorch Tutorials 1.7.1 documentation )? That uses
send
/recv
andall_reduce
etc… there are so many options that this confusing and frustrating.
Hope this can help: PyTorch Distributed Overview — PyTorch Tutorials 2.1.1+cu121 documentation
The first version of RPC is actually built on top of send/recv. To make distributed model parallel easier, it also provides features like distributed autograd (so that you don’t need to manually handle backward gradients using send/recv), Remote Reference (so that you can share a remote object without copying real data), distributed optimizer (so that you don’t need to manually call opt.step() on every participating process), etc.
I was leaving it empty but yes None is enough. I can’t remember if I red that comment from the pytorch code directly or if it was in the docs when I was doing it:
the issue I am facing right now is however, that even though I have 112 cores (or cpus not sure the difference) available my serial code is actually faster…?! not sure if you had that issue or anyone knows a solution.
Wow that is really useful…wish I saw this went it was most relevant to me. If I get back to this issue I will check if I have cores, cpus etc those details exact and report back here to you. Thanks for your help. Have you been able to speed up your code at all with multiple CPUs? How did you do it?
I recall did some similar experiments before, and noticed that even if I run serial PyTorch code, it could still make all my 24 CPU cores busy. I guess through OMP/parallelized-for-loop? @albanD would know more
But if you have two machines and if communication is not the main bottleneck, I would assume CPU DDP can still help.
Low level libraries that we use (OpenBLAS, MKL, etc) all do multithreading under the hood. So yes, for CPU compute, there is no need for multi processing.