You can just call these functions one after the other and they will run in parallel. The cuda api is completely asynchronous and so it will happen automatically.
It seems that they actually runs in sequential. The GPU usage will be high to low one after the other.
I can better confirm that if there is any profiling tool for pytorch GPU.
In that case that means that you have some synchronization points in your model.
You don’t want to do any cpu/gpu op. which means copy, printing or .item() call on gpu data.
In particular, you should send all data on the gpu. Then forward both nets. Then use the output of both nets.
You can use nvidia visual profiler to check for this in more details but it is quite extensive and might be too complex for what you need.
A simple trick to check where the sync happens is to time how long a torch.cuda.synchronize() call takes to return: just after the forward, it should take a but of time as all the network’s ops still run. if you do a .item() on a gpu tensor then time the synchronize, it will return instantly.