Hi, I want to run 2 different networks that take the same input in parallel, with multiple gpus.
The main purpose is to achieve inference speed up. So I did:
inputs = torch.randn(1, 3, 256, 256)
inputs1 = inputs.to('cuda:0')
inputs2 = inputs.to('cuda:1')
model1 = model1.to('cuda:0')
model2 = model2.to('cuda:1')
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()
torch.cuda.synchronize()
i = 0
time_spent = []
while i < 1000:
with torch.no_grad():
start_time = time.time()
with torch.cuda.stream(s1):
_ = model1(inputs1)
with torch.cuda.stream(s2):
_ = model2(inputs2)
if device == 'cuda':
torch.cuda.synchronize()
if i > 100:
time_spent.append(time.time() - start_time)
However, when I measure the average speed, I did not see any speed improvement compared to running model1 and model2 in a sequential manner on a single gpu. Am I missing something here? I am running the test on ubuntu 18.04 with two RTX-2080TIs. Many thanks!