Running two inferences in parallel on the same GPU will slower

yj_z · November 28, 2018, 4:06pm

Run the same code on a GPU. When only one process is running, the time is about 5 ms per image, and the gpu-util is about 50%. When you run the same program again, both of them are about 10ms per image, and the gpu-util is also about 50%. Why? and how to solve it?

import torch
import torchvision.models as models
import numpy as np
import time
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
resnet18 = models.resnet18()
resnet18.avgpool = torch.nn.AdaptiveAvgPool2d(1)
im = torch.from_numpy(np.random.rand(1,3,100,100).astype(np.float32))
resnet18 = resnet18.cuda()
im = im.cuda()
for i in range(10000):
    t1 = time.time()
    pred = resnet18(im)
    t2 = time.time()
    print(t2 - t1)

and you can use this code to reproduce this problem. first, run this script, and you can find about 5ms per im. and run this script in two processes on same GPU, and you can find every processes spend 5ms per im.