B_model and C_model both take out_A as input, but they run independently.
To my understanding, PyTorch will run these lines serially, right? If B_model and C_model can run in parallel (e.g., multi-threading), will it save a lot of time?
Very, very, very interested in this! Anyone?
I’m hoping there would be a way to parallelize this somehow on one GPU (which should generalize to multi-GPU and multi-node). But lack the details to judge if that’s possible.
Briefly, you need either multiple torch.cuda.stream contexts, or jit.fork (in jit compiled code) to also separate cpu threads (that enqueue cuda operations). Unfortunately, speedup on one gpu may be limited, if gpu utilization in affected code is already high, or code fragments are small.