How do I run Inference in parallel?

Hello,
I have 4 GPUs available to me, and I’m trying to run inference utilizing all of them. I’m confused by so many of the multiprocessing methods out there (e.g. Multiprocessing.pool, torch.multiprocessing, multiprocessing.spawn, launch utility).
I have a model that I trained. However, I have several hundred thousand crops I need to run on the model so it is only practical if I run processes simultaneously on each GPU. I have 4 GPUs available to me. I would like to assign one model to each GPU and run 1/4 the data on each. How can I do this?
Thank you in advance.

1 Like

Since parallel inference does not need any communication among different processes, I think you can use any utility you mentioned to launch multi-processing. We can decompose your problem into two subproblems: 1) launching multiple processes to utilize all the 4 GPUs; 2) Partition the input data using DataLoader.

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def run_inference(rank, world_size):
    # create default process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)
    
    # load a model 
    model = YourModel()
    model.load_state_dict(PATH)
    model.eval()
    model.to(rank)

    # create a dataloader
    dataset = ...
    loader = torch.utils.data.DataLoader(dataset=dataset,
                                               batch_size=batch_size,
                                               shuffle=True,
                                               num_workers=4)

    # iterate over the loaded partition and run the model
    for idx, data in enumerate(loader):
            ...

def main():
    world_size = 4
    mp.spawn(run_inference,
        args=(world_size,),
        nprocs=world_size,
        join=True)

if __name__=="__main__":
    main()
3 Likes

Thank you. I will try this out now. I’m assuming that “example” in mp.spawn is the run_inference function?
Also, is it possible to make each GPU run multiple processes or no?

I’m assuming that “example” in mp.spawn is the run_inference function?

Yes, that’s a typo. Fixed now.

Also, is it possible to make each GPU run multiple processes or no?

Running multiple processes on each GPU will be slower, so not recommended IMO.

1 Like

I recommend using a custom sampler.
Related thread: DistributedSampler

By default, DistributedSampler divides the dataset by the number of processes (equivalent to #GPUs).
In the above thread, I provided an example modification on the sampler to avoid duplication of data.

1 Like

Would the above code stay the same, and I would add the DistributedSampler to verify that each process is getting an equal split of different data?

DistributedSampler with modification will give you the almost equal-sized splits.

  • I don’t know how you defined your model but you should also use DDP to maximally parallelize the models with multiple GPUs & use DistributedSampler with multiple processes.
  • make sure to customize the sampler so that there is no overlap between the different ranks (processes).
  • you should communicate between different processes to collect loss or accuracy metrics.

You may want to take a look at my github repository for an example.

1 Like
  • I don’t know how you defined your model but you should also use DDP to maximally parallelize the models with multiple GPUs & use DistributedSampler with multiple processes.

Do you mean using DDP for inference for this case?

@wayi
Fix: multiprocessing without DDP can also work if limited to inference only.

It is my preference is to use DDP at inference, too, because I don’t want to change my model object at training time which is DDP.

1 Like

There’s no communication between processes during inference, I don’t think you need gloo here. You can just run n processes with different CUDA_VISIBLE_DEVICES.

1 Like

I am wondering what would be the difference if I add the DistributedSampler to the dataloader here?