One GPU inference with one model and merge four times results together

I have four gpus, I make one program to load four models and inference each model with shared data in each single gpus, then merge each model’s results. For example:

models = [model1.to('cuda:0'), model2.to('cuda:1'), model3.to('cuda:2')]
data = get_data()
results = []
for index, model in enumerate(models):
    data = data.to('cuda:%d'%index)
    r = model(data)
    result.append(r)
final_result = fusion_result(result)

But right now, each model can only get data from a single process, can I get data and share to multi-model and do asynchronous inference by multiprocessing? Many thanks.

Since the models and data tensors are already pushed to different devices, each execution is already performed asynchronously, which should also be visible in e.g. nvprof or NSIGHT.

According to my code, the GPU will do inference only call model(data) , which means if my model is so complex, another three gpu will stop running for a while until one gpu finish inference. I am wonder that can those inferences doing at the same time with shared data? Thank you so much.

model(data) will execute the forward pass on the used device inside the model.
If your model implementation is only using a single device (i.e. no model sharding is used), the execution will begin asynchronously.
The next model(data) call, will launch the execution on the next specified device (assuming you’ve created copies of the model and data on other devices).

Yes, but the remaining device will suspend until the next model(data) call, which spends a lot of time. What I want to do right now is send the shared data to different devices at the same time, then each device will execute the model(data) forward asynchronously and merge the result in the host process. Any idea to do that? Thank you so much for your reply.

You can use SimpleQueue in torch.multiprocessing to do that. E.g., you can create a queue between the host process and each subprocess, and use the queue to pass input data and collect output. The test below can serve as an example:

Thank you so much, I find that the pickle and unpickle spend a lot of time when transfer the numpy image to subprocessing queue.put(img) and get result queue.get() . It is not efficient than loop (in single process). Do you have some idea to save this problem ? Many thanks.

Does using a shared_memory tensor help in this case? See the doc below:

@kehuantiantang can you please clarify the nature of the data, is it pinned memory on CPU or it is on GPU? It might be more efficient to put it on one of the GPU devices and share it with others via IPC instead of transferring it from device to device using .to() call.

Also it seems that you are looking for solution to evenly distribute load between GPUs as one of the models are much faster than others. In this case I recommend to try outer loop which will rotate models between GPUs (if memory allows) and avoid fusion_result calls as long as possible as it is looks like your synchronization point.

Sorry. I am not clarify so clearly, my data is a numpy image which shape is 416x416x3, because my model is so large, it cannot load into one gpu at same time, so I use model1.to(cuda:0), data.to('cuda:0) and rotate models between GPUs by for loop. seem like that:


All the model run in single processes, and can only inference in one GPU(other will suspend).

Yes, I try this one, to deliver the data to each sub-process by queue. I find that queue need to pickle and unpickle operation, but my data is numpy image which has shape 416x416x3. The pickle and unpickle speed a lot of time. It is not efficient than for loop, so I give up. Thanks a lot.