How to create new `CUcontext` for different threads of the same process

I’ve noticed the, when multiple threads run GPU pytorch code, the operations happen in serial ( i.e. the speed of having 10 threads of the same process each perform a GPU task is the same as having 1 thread perform 10 tasks ). However, when I spanw multiple processes, there is a significant and linear speedup for upto 3 processes.

  1. Why do you think this could be? I hypothesize: by default, same CUcontext is used among multiple threads of the same process, where as different processes use different CUcontext.

  2. If my hypothesis in (1) is correct, then how can I manually create CUcontext for each thread in pytorch?

Are you using different CUDA streams for every thread?

By default they will all use the same CUDA stream, which will serialize all operations.

See c10::Stream if you’re developing against master, or this one if you’re developing on 1.0.1.