Run two pre-trained models in parallel on the same GPU

I am trying to get an output from two different pre-trained models using the same input in parallel.
I tried threading and multiprocessing, in threading, the code gets slower, in multiprocessing, the functions responsible for running the pre-trained models do not fire.

The code used is:

from multiprocess import Pool, Process, set_start_method, Queue


from threading import Thread
output1parallel = None
output2parallel = None

def getOutput1(q):

  print("IM HERE 1\n")
  global loader, resnet18, output1parallel
  for i,batch in enumerate(loader):
    currentBatch = batch.cuda()
    resnet18 = resnet18.cuda()
    output1parallel = resnet18(currentBatch).cpu()
    del currentBatch
  q.put('hello')

def getOutput2(q):
  
  print("IM HERE 2\n")
  global loader, densenet,output, output2parallel

  for i,batch in enumerate(loader):
    currentBatch = batch.cuda()
    densenet = densenet.cuda()
    output2parallel = densenet(currentBatch).cpu()

    del currentBatch
  q.put('hello')

if __name__ == '__main__':
  set_start_method('spawn', force=True)
  densenet.share_memory()
  resnet18.share_memory()
  start = time.time()
  q = Queue()
  p1 = Process(target = getOutput1, args=(q,) )
  p2 = Process(target = getOutput2, args=(q,) )

  p1.start() 
  p2.start() 
  print(p1, p1.is_alive())
  
  print(p2, p2.is_alive())
  p1.join()
  p2.join()
print("Time for parallel implementation: {}".format(time.time() - start))

Hey @Mamdouh_Aljoud, please use torch.multiprocessing instead. See examples here: Multiprocessing best practices — PyTorch 1.7.0 documentation

Besides, I would recommend first try using multiple CUDA streams in the same process, using one stream for each model. See examples here: CUDA semantics — PyTorch 1.7.0 documentation