Mutiprocessing with multiple gpu

Hi I currently have a model which needs to runs multiple same forward paths in testing time (not training) at a certain node and get the average, similar to the trajectory sampling in the RL monte carlo search tree. My pseudo code would be like on one gpu :

for idx in range(num) : 
    stats.append(model(node_input))

However, the above method seems not efficient and then I would like to try using both multiprocessing and make full use of multiple gpu. My design would be produce multiple processes and feed into a certain cpu core and then a gpu whenever they are available. I would appreciate any pytorch resources or repo and advices on multiprocessing + multiple gpu usage.