Multiprocessing vs. Nvidia MPS for parallel training on a single GPU

I’m interested in parallel training of multiple instances of a neural network model, on a single GPU. Of course this is only relevant for small models which on their own, don’t utilize the GPU well enough.
According to this, Pytorch’s multiprocessing package allows to parallelize CUDA code.

I think that with slight modification of this example code, I managed to do what I wanted (train several models) instead of training a single model using the Hogwild algorithm, and it worked pretty well. I managed to train 10 models in less then 150% the time it took to train a single one.

Is this functionality related in any way to Nvidia’s MPS capability? I suspect not, because it doesn’t seems to involve anything related to starting an MPS server from the system, setting the GPU to work in EXCLUSIVE mode and so on.

I understand that MPS is the way in which Nvidia supports CUDA multithreading/multiprocessing. Otherwise, it’s not really possible to tell the GPU to do things in parallel.
Is this true? if it is then what exactly multiprocessing does and how does it work so well on the GPU without using MPS? :slight_smile:

Can anyone point me to an example which does demonstrate the use of MPS with Pytorch? I couldn’t really find one. And strangely enough, the above 28-page Nvidia guide on MPS doesn’t include any example in Pytorch or any other leading framework.



cc @ngimel @mcarilli for NVidia MPS

1 Like

I understand that MPS is the way in which Nvidia supports CUDA multithreading/multiprocessing.

hmm we need to be more specific. Each process receives its own cuda context on each device used by the process. Per-device contexts are shared by all CPU threads within the process. Any CPU thread in the process may submit work to any cuda stream (the kernel launch and stream API are thread safe), and the work may run concurrently with work submitted from other CPU threads. And of course, each kernel may use thousands of GPU threads.

By default (without MPS) each device runs kernels from only one context (process) at a time. If several processes target the same device, their kernels can’t run concurrently and GPU context switches between processes will occur. MPS multiplexes kernels from different processes so kernels from any thread of any process targeting that device CAN run concurrently (not sure how MPS works at a low level, but it works).

MPS is application-agnostic. After starting the MPS daemon in your shell:
nvidia-cuda-mps-control –d
all processes (Python or otherwise) that use the device have their cuda calls multiplexed so they can run concurrently. You shouldn’t need to do anything pytorch-specific: start the MPS daemon in the background, then launch your pytorch processes targeting the same device.

One thing I don’t know is whether nccl allreduces in Pytorch can handle if data from all processes is actually on one GPU. I’ve never seen it tried. Sounds like your case doesn’t need inter-process nccl comms though.

MPS has been around for years, and works on any recent generation. It is NOT the same thing as “multi-instance GPU” or MIG, which is Ampere-specific. (I think MIG sandboxes client processes more aggressively than MPS, providing better per-process fault isolation among other things. MPS should be fine for your case.)


Thanks for your reply!

One thing I don’t understand then, is how come in practice this example can parallelize training (on single GPU), without MPS, and yet it doesn’t even use different cuda streams?
Does it mean it just happens on high level due to better utilization of “idle” times, where the GPU is not busy, and these times now can be filled by commands from other threads/processes? If that’s the case it’s surprising that the improvement is almost linear in the number of processes.

So it means that without MPS, work submitted from different CPU processes cannot run concurrently on the same GPU, but work submitted from different threads can (via streams)?
If I understand correctly, the example code above uses processes?

Without MPS, I don’t see how this example could run work from different processes concurrently. It’s possible the model is so small, and each process’s individual utilization is so low, that you observe a speedup with multiple processes even if they aren’t running concurrently. Hard to tell without profiling.

work submitted from different CPU processes cannot run concurrently on the same GPU, but work submitted from different threads can (via streams)?

Streams are free-floating from threads. Any thread may submit work to any stream.

  1. If all threads submit work to the same stream, those kernels will be serialized, even if the threads are in the same process. However, no context switches will be required.
  2. If threads in the same process submit work to different streams, the kernels may run with true concurrency (overlapping).
  3. If threads in different processes submit work to the same device, without MPS the kernels may not run concurrently regardless what streams are used, because they are in separate contexts. MPS allows processes to submit work to the device without a context switch. I believe (but i’m not sure) kernels may run with true concurrrency (ie, overlapping) even if processes each use their default stream. But even if the kernels from different processes can’t truly overlap, MPS increasing the density of work submission by avoiding expensive context switches is beneficial.

Pytorch does not compile to use per-thread default streams. By default, all threads submit work to a shared default stream (called “legacy” default stream in some cuda docs). Therefore, unless you manually set stream contexts in each thread, case 1 applies.

Not only that it works concurrently (in the sense that running the example with 10 processes vs. 1 results in much less than 10x slowdown), but in this case, MPS almost doesn’t help further.
I think I’ve convinced myself by now that I know how to start and stop an MPS server. I modified this code to use a simpler network (MLP) and not use a dataloader but instead just have all data in memory, and in this case MPS showed some improvement, but still far from linear. That’s how I knew that MPS works.

But okay, I guess this doesn’t explain much without profiling like you said. Maybe this example is too complex. Maybe there are idle GPU times implied in it which allow the speedup without MPS, and don’t allow MPS to help because the Cuda computations aren’t actually possible to parallelize due to low utilization.

That’s why it would be very helpful to see some very minimal Pytorch example that can demonstrate good MPS usage, i.e. if you set num_processes=1, then the run takes X seconds, and then you change it to num_processes=10, and the runtime stays about X seconds (plus a small overhead), but without MPS it would be 10X.
Maybe you could provide a clue on how to write such minimal example? i.e, should I use torch.multiprocessing as in the example above? or something else for dispatching processes to the GPU?

That’s interesting. So according to this (case 3) I understand that this example may not be optimal for execution with MPS. It might be beneficial to try and further modify this example, and make sure that each process does it’s Cuda work on a different stream? because I don’t think this is the case now. I will try it out. Thanks!

I’m interested in the training of multiple neural network models on a single GPU as well. For example, running MobileNet and ResNet on a single GPU at the same time. Is there any example code how could I implement that on PyTorch? with MPS and without MPS (naive to running).

Appreciate for any help :smiley:

1 Like