How to parallelize a loop over the samples of a batch

albanD · February 23, 2021, 9:08pm

cc @mrshenli in case he has more ideas about how these would work with DDP.

Brando_Miranda · March 2, 2021, 1:27pm

hi @mrshenli I made a DDP example and managed to parallelize it with DDP. But I have access to 110 CPUs not gpus. But the single process code is faster than the 10 and 110 cpu process code. Why is that? Is there something I can do to fix this?

mrshenli · March 17, 2021, 8:18pm

Hey @Brando_Miranda @albanD sorry for being late to this discussion.

I made a DDP example and managed to parallelize it with DDP. But I have access to 110 CPUs not gpus. But the single process code is faster than the 10 and 110 cpu process code. Why is that? Is there something I can do to fix this?

Are those CPUs/machines or CPU cores? IIRC, PyTorch operators already parallelizes across multiple CPU cores (@albanD please correct me if I was wrong).

If its multiple cores on the same machine, how did you make sure that each DDP process exclusively operates on a set of CPU cores?

Another thing is that, it might be DDP’s CPU communication overhead is overshadowing the compute parallelization speedup. The gradient synchronization comm overhead is roughly constant, and independent to the batch size. One helpful exercise might try to increase the batch size and see the gap shrinks.

Just wanna confirm when using DDP on 10 CPUs, did you set the per process batch-size to 1/10 compared to local training?

mrshenli · March 17, 2021, 8:25pm

I assume you have already figured this out. Responding here for future users. This can be done by setting the device_ids to None or empty list. Quote from the API doc:

github.com

pytorch/pytorch/blob/fd5c1123e4ea3ab0b453a0b6b3e4d1e0e568a871/torch/nn/parallel/distributed.py#L310-L313


      
          :attr:`module` replica is placed on ``device_ids[i]``. For
          multi-device modules and CPU modules, ``device_ids`` must be
          ``None`` or an empty list, and input data for the forward
          pass must be placed on the correct device. (default: all

How is the RPC framework different from what this tutorial shows (Writing Distributed Applications with PyTorch — PyTorch Tutorials 1.7.1 documentation )? That uses send /recv and all_reduce etc… there are so many options that this confusing and frustrating.

Hope this can help: PyTorch Distributed Overview — PyTorch Tutorials 2.1.1+cu121 documentation

The first version of RPC is actually built on top of send/recv. To make distributed model parallel easier, it also provides features like distributed autograd (so that you don’t need to manually handle backward gradients using send/recv), Remote Reference (so that you can share a remote object without copying real data), distributed optimizer (so that you don’t need to manually call opt.step() on every participating process), etc.

Brando_Miranda · March 18, 2021, 11:11pm

I was leaving it empty but yes None is enough. I can’t remember if I red that comment from the pytorch code directly or if it was in the docs when I was doing it:

github.com

pytorch/pytorch/blob/fd5c1123e4ea3ab0b453a0b6b3e4d1e0e568a871/torch/nn/parallel/distributed.py#L368




Attributes:
    module (Module): the module to be parallelized.

Example::

    >>> torch.distributed.init_process_group(backend='nccl', world_size=4, init_method='...')
    >>> net = torch.nn.parallel.DistributedDataParallel(model, pg)
"""
def __init__(self, module, device_ids=None,
             output_device=None, dim=0, broadcast_buffers=True,
             process_group=None,
             bucket_cap_mb=25,
             find_unused_parameters=False,
             check_reduction=False,
             gradient_as_bucket_view=False):

    super(DistributedDataParallel, self).__init__()

    assert any((p.requires_grad for p in module.parameters())), (

the issue I am facing right now is however, that even though I have 112 cores (or cpus not sure the difference) available my serial code is actually faster…?! not sure if you had that issue or anyone knows a solution.

Brando_Miranda · March 18, 2021, 11:13pm

mrshenli:

Hey @Brando_Miranda @albanD sorry for being late to this discussion.

I made a DDP example and managed to parallelize it with DDP. But I have access to 110 CPUs not gpus. But the single process code is faster than the 10 and 110 cpu process code. Why is that? Is there something I can do to fix this?

Are those CPUs/machines or CPU cores? IIRC, PyTorch operators already parallelizes across multiple CPU cores (@albanD please correct me if I was wrong).

If its multiple cores on the same machine, how did you make sure that each DDP process exclusively operates on a set of CPU cores?

Another thing is that, it might be DDP’s CPU communication overhead is overshadowing the compute parallelization speedup. The gradient synchronization comm overhead is roughly constant, and independent to the batch size. One helpful exercise might try to increase the batch size and see the gap shrinks.

Just wanna confirm when using DDP on 10 CPUs, did you set the per process batch-size to 1/10 compared to local training?

Wow that is really useful…wish I saw this went it was most relevant to me. If I get back to this issue I will check if I have cores, cpus etc those details exact and report back here to you. Thanks for your help. Have you been able to speed up your code at all with multiple CPUs? How did you do it?

mrshenli · March 19, 2021, 8:29pm

I recall did some similar experiments before, and noticed that even if I run serial PyTorch code, it could still make all my 24 CPU cores busy. I guess through OMP/parallelized-for-loop? @albanD would know more

But if you have two machines and if communication is not the main bottleneck, I would assume CPU DDP can still help.

albanD · March 22, 2021, 6:38pm

Low level libraries that we use (OpenBLAS, MKL, etc) all do multithreading under the hood. So yes, for CPU compute, there is no need for multi processing.