Total number of processes and threads created using nn.distributed.parallel

joshua_mora · February 25, 2020, 7:24pm

Under to the context of training using python front end.

Where could I find some information about the total number of processes and threads when using nn.distributed.parallel module ?

If I have a simple neural network (eg. MNIST) and I do distributed data parallelism where I assign 1 process per GPU, and I have both training and eval going on and a dataloader with 1 worker, should I have only 3 processes per GPU: 1 main process (the training one) that spawns eval process and dataloader process (total of 3 processes). Then within the main process: a thread for scheduling work, a thread for forward, a thread for backward, a thread to deal with eval process, a thread to deal with dataloader, a thread for cache manager. That is 6 threads. When profiling I get to see several mode. Is there any document where I can get that info ? Also if BWD is consuming what FWD is producing, is there a way I can “merge” both threads of FWD and BWD in a single thread ? Is there also a way to not dealloc objects from the cache allocator if the number of objects (tensors, model) remains the same from iteration to iteration, so I can avoid the expensive mmap/munmap ?

Thanks in advance.

osalpekar · February 25, 2020, 9:44pm

In terms of the total number of processes, the num_workers argument you pass to the DataLoader class determines the number of subprocess it uses (0 means the dataloader uses the main process). Here is some documentation: https://pytorch.org/docs/1.1.0/_modules/torch/utils/data/dataloader.html.

For the number of threads, this varies based on the communication backend you use (which is passed to init_process_group). For example the gloo backend uses 2 threads per device: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/init.cpp#L565.

joshua_mora · February 25, 2020, 10:24pm

Thanks for the quick reply.

I am using NCCL backend. I do not see how many threads are being passed to create the group. at https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupNCCL.cpp
Is there a knob to control that ?

I am using dataloder with num_workers set to 1 (so main process spawns a separate process ?)
Based on the information on https://pytorch.org/docs/1.1.0/_modules/torch/utils/data/dataloader.html.
At each iteration the dataloader process is created and destroyed (if num_workers!=0) which has some overhead ?
Can we keep the processes (depending on how many samples within the batch you want to work concurrently) across iterations so we do not incur into that overhead ?

I am basically trying to prune the number of processes and threads, while I understand I may restrict generality but I am trying to speed up the execution when I am CPU bound.

Thanks in advance.

osalpekar · February 25, 2020, 11:15pm

The number of threads is currently not tunable by the user, but we’re considering making this possible in a future release.

Right, num_workers=1 would spawn a separate process. Here’s an issue tracking the discussion around keeping subprocesses alive across iterations (with a patch that should make this possible):

github.com/pytorch/pytorch

DataLoader with option to re-use worker processes

opened 12:21AM - 09 Jan 19 UTC

dashesy

high priority feature module: dataloader triaged

## 🚀 Feature Currently after an epoch is ended dataloader spawns a new process …to read data. This means if processes have cached some internal state (db connection, indexing, ...) they will be lost and the new process will have the overhead of creating the connection or indexing. ## Motivation I have a custom dataset (that indexes data for fast retrieval) and only part of the dataset is used in training. If I have a training with longer epochs, training will be faster than shorter epochs because when an epoch ends a new process will have to re-read and index the custom dataset. ## Pitch A flag to tell dataloader to re-use the worker processes. If the process pool has the re-use option ([Loky from joblib](https://github.com/tomMoral/loky) does) that would be enough. ## Alternatives Documentation should specify this issue. Option to use [Loky](https://github.com/tomMoral/loky) for all multi-processing backend will be also great. ## Additional context Started a discussion [here](https://discuss.pytorch.org/t/dataloader-re-use-worker-processes/34071) cc @ezyang @gchanan @SsnL

joshua_mora · February 26, 2020, 12:41am

Thanks for the pointer to this discussion on the dataloader.
With respect to the number of threads/processes, I still miss to understand all the other threads being generated.
Is it possible for example to use same thread for backward and forward if they deal with same model and batch, instead of having 2 threads that could assume using different models and samples ?
NCCL process group also accepts the option of size which I am not sure it refers also to the number of threads.
Are there any other hardcoded number of threads that I could reduce ? ( set_num_threads(1)/set_num_interop_threads(1) will not prevent from creating a bunch of threads, larger than 6 per process that deals with each GPU). I have very few cores available per GPU (~4) so I need to restrict the number of threads to what is necessary.

Thanks, again.

osalpekar · February 26, 2020, 9:53pm

I’m not sure of any way to coerce forward and backward into using the same thread.

That size actually refers to world_size, which is the total number of ranks in your job

This might provide some more insight into tuning the number of threads: Number of CPU threads for the python process · Issue #16894 · pytorch/pytorch · GitHub. For example, the OMP_NUM_THREADS env var is used for controlling the number of OpenMP threads for CPU operations and MKL_NUM_THREADS for mkl.

joshua_mora · February 26, 2020, 11:13pm

Thanks @osalpekar.
OpenMP environment variables dont make a difference if torch.set_num_threads is already set.
In fact I use GOMP_CPU_AFFINITY to enforce a particular set of openMP threads to run on specific cores. I did play also with OMP/MKL_DYNAMIC set to false.
I still do not understand though what torch.set_num_threads controls if I end up having 1 thread for FWD and 1 different thread for BWD. And several other threads.
I may think that there is a total amount of work and some “empirical” definition of how many threads to have for certain amount of work. the env var will overwrite that heuristic.
Just trying to find the code where that is precribed/defined.
Are you aware of a document where it is described the architecture in terms of thread/functionality ?
Why would I get > 6 threads if I have torch.set_num_threads set to 1.

Regards.