One GPU inference with one model and merge four times results together

kehuantiantang · October 19, 2020, 7:37am

I have four gpus, I make one program to load four models and inference each model with shared data in each single gpus, then merge each model’s results. For example:

models = [model1.to('cuda:0'), model2.to('cuda:1'), model3.to('cuda:2')]
data = get_data()
results = []
for index, model in enumerate(models):
    data = data.to('cuda:%d'%index)
    r = model(data)
    result.append(r)
final_result = fusion_result(result)

But right now, each model can only get data from a single process, can I get data and share to multi-model and do asynchronous inference by multiprocessing? Many thanks.

ptrblck · October 19, 2020, 7:50am

Since the models and data tensors are already pushed to different devices, each execution is already performed asynchronously, which should also be visible in e.g. nvprof or NSIGHT.

kehuantiantang · October 19, 2020, 7:55am

According to my code, the GPU will do inference only call model(data) ， which means if my model is so complex, another three gpu will stop running for a while until one gpu finish inference. I am wonder that can those inferences doing at the same time with shared data? Thank you so much.

ptrblck · October 19, 2020, 8:21am

model(data) will execute the forward pass on the used device inside the model.
If your model implementation is only using a single device (i.e. no model sharding is used), the execution will begin asynchronously.
The next model(data) call, will launch the execution on the next specified device (assuming you’ve created copies of the model and data on other devices).

kehuantiantang · October 19, 2020, 2:31pm

Yes, but the remaining device will suspend until the next model(data) call, which spends a lot of time. What I want to do right now is send the shared data to different devices at the same time, then each device will execute the model(data) forward asynchronously and merge the result in the host process. Any idea to do that? Thank you so much for your reply.

mrshenli · October 19, 2020, 2:34pm

You can use SimpleQueue in torch.multiprocessing to do that. E.g., you can create a queue between the host process and each subprocess, and use the queue to pass input data and collect output. The test below can serve as an example:

github.com

pytorch/pytorch/blob/0c5cd8c2b9cdf473e30bbb1b49ca80ed442813df/test/test_multiprocessing.py#L577-L600


@unittest.skipIf(NO_MULTIPROCESSING_SPAWN, "Disabled for environments that \
                 don't support multiprocessing with spawn start method")
@unittest.skipIf(not TEST_CUDA_IPC, 'CUDA IPC not available')
def test_event_multiprocess(self):
    event = torch.cuda.Event(enable_timing=False, interprocess=True)
    self.assertTrue(event.query())

    ctx = mp.get_context('spawn')
    p2c = ctx.SimpleQueue()
    c2p = ctx.SimpleQueue()
    p = ctx.Process(
        target=TestMultiprocessing._test_event_multiprocess_child,
        args=(event, p2c, c2p))
    p.start()

    c2p.get()  # wait for until child process is ready
    torch.cuda._sleep(50000000)  # spin for about 50 ms
    event.record()
    p2c.put(0)  # notify child event is recorded

This file has been truncated. show original

kehuantiantang · October 21, 2020, 1:09am

Thank you so much, I find that the pickle and unpickle spend a lot of time when transfer the numpy image to subprocessing queue.put(img) and get result queue.get() . It is not efficient than loop (in single process). Do you have some idea to save this problem ? Many thanks.

mrshenli · October 21, 2020, 6:37pm

Does using a shared_memory tensor help in this case? See the doc below:

VitalyFedyunin · October 22, 2020, 4:16pm

@kehuantiantang can you please clarify the nature of the data, is it pinned memory on CPU or it is on GPU? It might be more efficient to put it on one of the GPU devices and share it with others via IPC instead of transferring it from device to device using .to() call.

Also it seems that you are looking for solution to evenly distribute load between GPUs as one of the models are much faster than others. In this case I recommend to try outer loop which will rotate models between GPUs (if memory allows) and avoid fusion_result calls as long as possible as it is looks like your synchronization point.

kehuantiantang · October 23, 2020, 12:22am

Sorry. I am not clarify so clearly, my data is a numpy image which shape is 416x416x3, because my model is so large, it cannot load into one gpu at same time, so I use model1.to(cuda:0), data.to('cuda:0) and rotate models between GPUs by for loop. seem like that:

All the model run in single processes, and can only inference in one GPU(other will suspend).

kehuantiantang · October 23, 2020, 1:25am

Yes, I try this one, to deliver the data to each sub-process by queue. I find that queue need to pickle and unpickle operation, but my data is numpy image which has shape 416x416x3. The pickle and unpickle speed a lot of time. It is not efficient than for loop, so I give up. Thanks a lot.