Is there any support for asynchronous inference?
Since inference on GPU will also block the CPU, I hope I can process some CPU tasks while waiting.
By default cuda kernels are run asynchronously (you need to call torch.cuda.synchronize()) to block until all launched kernels are done.
I know this, but my code don’t works like async
import torch
data = torch.randn(128,16,3,224,224)
data = data.to('cuda')
from torchvision import models
model = models.squeezenet1_1(pretrained=True)
model.to('cuda')
for i in range(100):
model(data[i])
What I expect:
- Immediately stoped, but take a long time
- add torch.cuda.synchronize() don’t cost extra time
How many GPUs are there in this setup?
As a followup it might be useful to investigate multiprocessing in this setup even if it is indeed clunky (as a way to avoid unexpected GIL interactions).