Switching between multi-processing and main process

Hi all,

I am fairly new to multi-processing and could not find info on running parts of a python code on a single main process as part of a distributed training e.g. single node and 8 GPUs.

I launch as follow:
OMP_NUM_THREADS=12 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nnodes=1 --nproc_per_node=8 my_python_script.py --my_parser_args …

Distributed training works well, for context I use HuggingFace trainer with FairScale --sharded_ddp zero_dp_2, however it seems that the whole script is executed as many times as there are devices, in this case 8 times. This can be seen as e.g. all prints are replicated.

To my understanding, ideally the whole part of the script before training (which doesnt require GPUs) should only run as a single main process e.g. importing data, setting up hyper-parameters and so on. And once the training starts, it should be replicated to as many processes as there are used GPUs.

What is the PyTorch way to locally disable multi-processing please ?
In HuggingFace, there is an option “main_process_first” which lets a block of code be run with the main process, e.g. dataset pre-processing with tokenizer.

Thanks !

No, most likely this wouldn’t be ideal as you might be running into the Python GIL, which would then slow down your entire code. The single process - multiple devices approach is used in e.g. nn.DataParallel and is thus not recommended.
Since your workflow already seems to work fine, you should get the best performance using a single process per device.

Thanks for your answer, then I will continue this way if it seems desired that all operations including pre-processing and setup stages are replicated as many times as there are GPU devices used in distributed training.