How do I run Inference in parallel?

Vivek_Reddy · July 14, 2021, 6:15pm

Hello,
I have 4 GPUs available to me, and I’m trying to run inference utilizing all of them. I’m confused by so many of the multiprocessing methods out there (e.g. Multiprocessing.pool, torch.multiprocessing, multiprocessing.spawn, launch utility).
I have a model that I trained. However, I have several hundred thousand crops I need to run on the model so it is only practical if I run processes simultaneously on each GPU. I have 4 GPUs available to me. I would like to assign one model to each GPU and run 1/4 the data on each. How can I do this?
Thank you in advance.

wayi · July 14, 2021, 7:02pm

Since parallel inference does not need any communication among different processes, I think you can use any utility you mentioned to launch multi-processing. We can decompose your problem into two subproblems: 1) launching multiple processes to utilize all the 4 GPUs; 2) Partition the input data using DataLoader.

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def run_inference(rank, world_size):
    # create default process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)
    
    # load a model 
    model = YourModel()
    model.load_state_dict(PATH)
    model.eval()
    model.to(rank)

    # create a dataloader
    dataset = ...
    loader = torch.utils.data.DataLoader(dataset=dataset,
                                               batch_size=batch_size,
                                               shuffle=True,
                                               num_workers=4)

    # iterate over the loaded partition and run the model
    for idx, data in enumerate(loader):
            ...

def main():
    world_size = 4
    mp.spawn(run_inference,
        args=(world_size,),
        nprocs=world_size,
        join=True)

if __name__=="__main__":
    main()

Vivek_Reddy · July 14, 2021, 8:56pm

Thank you. I will try this out now. I’m assuming that “example” in mp.spawn is the run_inference function?
Also, is it possible to make each GPU run multiple processes or no?

wayi · July 14, 2021, 9:00pm

I’m assuming that “example” in mp.spawn is the run_inference function?

Yes, that’s a typo. Fixed now.

Also, is it possible to make each GPU run multiple processes or no?

Running multiple processes on each GPU will be slower, so not recommended IMO.

seungjun · July 15, 2021, 5:05am

I recommend using a custom sampler.
Related thread: DistributedSampler

By default, DistributedSampler divides the dataset by the number of processes (equivalent to #GPUs).
In the above thread, I provided an example modification on the sampler to avoid duplication of data.

Vivek_Reddy · July 15, 2021, 4:52pm

Would the above code stay the same, and I would add the DistributedSampler to verify that each process is getting an equal split of different data?

seungjun · July 16, 2021, 2:59am

DistributedSampler with modification will give you the almost equal-sized splits.

I don’t know how you defined your model but you should also use DDP to maximally parallelize the models with multiple GPUs & use DistributedSampler with multiple processes.
make sure to customize the sampler so that there is no overlap between the different ranks (processes).
you should communicate between different processes to collect loss or accuracy metrics.

You may want to take a look at my github repository for an example.

wayi · July 16, 2021, 3:42am

I don’t know how you defined your model but you should also use DDP to maximally parallelize the models with multiple GPUs & use DistributedSampler with multiple processes.

Do you mean using DDP for inference for this case?

seungjun · July 16, 2021, 4:26am

@wayi
Fix: multiprocessing without DDP can also work if limited to inference only.

It is my preference is to use DDP at inference, too, because I don’t want to change my model object at training time which is DDP.

Andrei_Pokrovsky · July 17, 2021, 5:11am

There’s no communication between processes during inference, I don’t think you need gloo here. You can just run n processes with different CUDA_VISIBLE_DEVICES.

zhangdan8962 · February 21, 2024, 1:11am

I am wondering what would be the difference if I add the DistributedSampler to the dataloader here?