Using streams doesn't seem to improve performance

I’m trying to run many small models in parallel. I noticed with my first implementation (just calling each of the models in a loop), that GPU utilization was very low, about 25%. So I did some research and it sounds like what I want are cuda streams. So I initialized two streams and broke my models into two groups and I ran the models now wrapped in a with block. The code ran fine, but GPU utilization remained low at about 25%.

Am I misunderstanding the purpose or function of streams?

1 Like


There are things to be careful when using streams (note that I’m not a specialist at all!) but if you do any op on the default stream, it will sync all streams. So you need to make sure no such op is done. the nvidia visual profiler is a great tool to see what runs where and what runs in parallel.

Also for very small problems, it’s possible that you are limited by how far you can ask for stuff to be done on the gpu, meaning that you are actually bound by the cpu code. To improve this, you would need to use different processes for each model.

I did try with torch.multiprocessing, but then python allocates a large amount of memory which seems to be roughly proportional to the number of processes torch.multiprocessing starts.

Yes that is expected, CUDA sharing between process is quite tricky. I don’t think we support it.

Really? I started 6 processes and ~12 gigs of RAM was allocated.

In any case, is there a way to use threading so that I can use multiple processors while still only having a single process?


Yes you can use python builtin threading with pytorch with no issue.