I’ve noticed the, when multiple threads run GPU pytorch code, the operations happen in serial ( i.e. the speed of having 10 threads of the same process each perform a GPU task is the same as having 1 thread perform 10 tasks ). However, when I spanw multiple processes, there is a significant and linear speedup for upto 3 processes.
-
Why do you think this could be? I hypothesize: by default, same
CUcontext
is used among multiple threads of the same process, where as different processes use differentCUcontext
. -
If my hypothesis in (1) is correct, then how can I manually create
CUcontext
for each thread in pytorch?