Hi, I want to run 2 different networks that take the same input in parallel, with multiple gpus.
The main purpose is to achieve inference speed up. So I did:
inputs = torch.randn(1, 3, 256, 256) inputs1 = inputs.to('cuda:0') inputs2 = inputs.to('cuda:1') model1 = model1.to('cuda:0') model2 = model2.to('cuda:1') s1 = torch.cuda.Stream() s2 = torch.cuda.Stream() torch.cuda.synchronize() i = 0 time_spent =  while i < 1000: with torch.no_grad(): start_time = time.time() with torch.cuda.stream(s1): _ = model1(inputs1) with torch.cuda.stream(s2): _ = model2(inputs2) if device == 'cuda': torch.cuda.synchronize() if i > 100: time_spent.append(time.time() - start_time)
However, when I measure the average speed, I did not see any speed improvement compared to running model1 and model2 in a sequential manner on a single gpu. Am I missing something here? I am running the test on ubuntu 18.04 with two RTX-2080TIs. Many thanks!