Understanding CUDAStream Usage in pytorch

Hi – I’m trying to visualize how cuda stream implemented by putorch ( CUDAStream.cpp ) impacts Cuda stream usage for concurrent model training tasks. For two models created by one process (initGlobalStreamState in CUDAStream.cpp is called only once) both running on gpu:0, my nvprof shows all kernels are issued on the default stream on gpu:0. However, if I create these models by separate processes each process has its own cuda context and kernels are issued to different streams. Just wondering if sharing cuda context would change the way cuda kernels (from two different models, running simultaneously) are dispatched on to GPUs.

Thanks!