How to create new `CUcontext` for different threads of the same process

0xFFFFFFFF · February 26, 2019, 8:45am

I’ve noticed the, when multiple threads run GPU pytorch code, the operations happen in serial ( i.e. the speed of having 10 threads of the same process each perform a GPU task is the same as having 1 thread perform 10 tasks ). However, when I spanw multiple processes, there is a significant and linear speedup for upto 3 processes.

Why do you think this could be? I hypothesize: by default, same CUcontext is used among multiple threads of the same process, where as different processes use different CUcontext.
If my hypothesis in (1) is correct, then how can I manually create CUcontext for each thread in pytorch?

pietern · March 12, 2019, 5:18pm

Are you using different CUDA streams for every thread?

By default they will all use the same CUDA stream, which will serialize all operations.

See c10::Stream if you’re developing against master, or this one if you’re developing on 1.0.1.