Total number of processes and threads created using nn.distributed.parallel

Under to the context of training using python front end.

Where could I find some information about the total number of processes and threads when using nn.distributed.parallel module ?

If I have a simple neural network (eg. MNIST) and I do distributed data parallelism where I assign 1 process per GPU, and I have both training and eval going on and a dataloader with 1 worker, should I have only 3 processes per GPU: 1 main process (the training one) that spawns eval process and dataloader process (total of 3 processes). Then within the main process: a thread for scheduling work, a thread for forward, a thread for backward, a thread to deal with eval process, a thread to deal with dataloader, a thread for cache manager. That is 6 threads. When profiling I get to see several mode. Is there any document where I can get that info ? Also if BWD is consuming what FWD is producing, is there a way I can “merge” both threads of FWD and BWD in a single thread ? Is there also a way to not dealloc objects from the cache allocator if the number of objects (tensors, model) remains the same from iteration to iteration, so I can avoid the expensive mmap/munmap ?

Thanks in advance.

In terms of the total number of processes, the num_workers argument you pass to the DataLoader class determines the number of subprocess it uses (0 means the dataloader uses the main process). Here is some documentation:

For the number of threads, this varies based on the communication backend you use (which is passed to init_process_group). For example the gloo backend uses 2 threads per device:

Thanks for the quick reply.

I am using NCCL backend. I do not see how many threads are being passed to create the group. at
Is there a knob to control that ?

I am using dataloder with num_workers set to 1 (so main process spawns a separate process ?)
Based on the information on
At each iteration the dataloader process is created and destroyed (if num_workers!=0) which has some overhead ?
Can we keep the processes (depending on how many samples within the batch you want to work concurrently) across iterations so we do not incur into that overhead ?

I am basically trying to prune the number of processes and threads, while I understand I may restrict generality but I am trying to speed up the execution when I am CPU bound.

Thanks in advance.

The number of threads is currently not tunable by the user, but we’re considering making this possible in a future release.

Right, num_workers=1 would spawn a separate process. Here’s an issue tracking the discussion around keeping subprocesses alive across iterations (with a patch that should make this possible):

Thanks for the pointer to this discussion on the dataloader.
With respect to the number of threads/processes, I still miss to understand all the other threads being generated.
Is it possible for example to use same thread for backward and forward if they deal with same model and batch, instead of having 2 threads that could assume using different models and samples ?
NCCL process group also accepts the option of size which I am not sure it refers also to the number of threads.
Are there any other hardcoded number of threads that I could reduce ? ( set_num_threads(1)/set_num_interop_threads(1) will not prevent from creating a bunch of threads, larger than 6 per process that deals with each GPU). I have very few cores available per GPU (~4) so I need to restrict the number of threads to what is necessary.

Thanks, again.

I’m not sure of any way to coerce forward and backward into using the same thread.

That size actually refers to world_size, which is the total number of ranks in your job

This might provide some more insight into tuning the number of threads: For example, the OMP_NUM_THREADS env var is used for controlling the number of OpenMP threads for CPU operations and MKL_NUM_THREADS for mkl.

Thanks @osalpekar.
OpenMP environment variables dont make a difference if torch.set_num_threads is already set.
In fact I use GOMP_CPU_AFFINITY to enforce a particular set of openMP threads to run on specific cores. I did play also with OMP/MKL_DYNAMIC set to false.
I still do not understand though what torch.set_num_threads controls if I end up having 1 thread for FWD and 1 different thread for BWD. And several other threads.
I may think that there is a total amount of work and some “empirical” definition of how many threads to have for certain amount of work. the env var will overwrite that heuristic.
Just trying to find the code where that is precribed/defined.
Are you aware of a document where it is described the architecture in terms of thread/functionality ?
Why would I get > 6 threads if I have torch.set_num_threads set to 1.