How to inference asynchronous

Forceless · May 7, 2022, 4:25am

Is there any support for asynchronous inference?
Since inference on GPU will also block the CPU, I hope I can process some CPU tasks while waiting.

eqy · May 8, 2022, 3:12am

By default cuda kernels are run asynchronously (you need to call torch.cuda.synchronize()) to block until all launched kernels are done.

Forceless · May 14, 2022, 10:30am

I know this, but my code don’t works like async

import torch
data = torch.randn(128,16,3,224,224)
data = data.to('cuda')
from torchvision import models
model = models.squeezenet1_1(pretrained=True)
model.to('cuda')
for i in range(100):
    model(data[i])

What I expect:

Immediately stoped, but take a long time
add torch.cuda.synchronize() don’t cost extra time

eqy · May 20, 2022, 5:48pm

How many GPUs are there in this setup?

As a followup it might be useful to investigate multiprocessing in this setup even if it is indeed clunky (as a way to avoid unexpected GIL interactions).