If I only want to run forward() with same inputs and different weights in parallel, then merge the output after all forward() is done. How can I do this in Pytorch? It looks like I might be able to use cuda stream? https://pytorch.org/docs/stable/notes/cuda.html#cuda-streams But I’m not sure if this is the right way to go. Is there a more efficient way to do this?
Thank you all.
You could use streams, but would have to make sure to properly synchronize the code to avoid race conditions.
Depending on the actual workload on the device, your speedup might not be huge, e.g. if the first model already uses the device sufficiently.
Thank you for the reply.
I tested streams, it looks like the speed up is not significant. Say I want to change weights in 10 different ways and run forward() as fast as possible 10 times with same inputs, what would you recommend to implement this?
The simplest approach would be to run the forward pass in a loop and to avoid any synchronizations, such as printing the output etc.