How to inference asynchronous

Is there any support for asynchronous inference?
Since inference on GPU will also block the CPU, I hope I can process some CPU tasks while waiting.

By default cuda kernels are run asynchronously (you need to call torch.cuda.synchronize()) to block until all launched kernels are done.

I know this, but my code don’t works like async

import torch
data = torch.randn(128,16,3,224,224)
data = data.to('cuda')
from torchvision import models
model = models.squeezenet1_1(pretrained=True)
model.to('cuda')
for i in range(100):
    model(data[i])

What I expect:

  • Immediately stoped, but take a long time
  • add torch.cuda.synchronize() don’t cost extra time

How many GPUs are there in this setup?

As a followup it might be useful to investigate multiprocessing in this setup even if it is indeed clunky (as a way to avoid unexpected GIL interactions).