How to run some modules concurrency in a single GPU?

For example, there is two modules in my model:

class Net(nn.Module):
  def __init__(self):
    self.module1 = Spatial_Path()
    self.module2 = Context_Path()
  def forward(self, input_):
    output1 = self.module1(input_)
    output2 = self.module2(input_)
    reuturn output1 + output2

Now I would like to run self.module1 and self.module2 concurrency to speed up. How?


Given that you don’t send any tensor to the cpu explicitly in these modules, they will run asyncrhonously.
It then depends how much work has to be done in each module but the forward pass should return very fast after queuing the job on the GPU. Everything should then run as fast as possible on the GPU.
Do you see very small gpu usage? Are big are the ops your doing? Are you actually spending more time on the cpu side than the gpu side?