Hi all,
I am fairly new to multi-processing and could not find info on running parts of a python code on a single main process as part of a distributed training e.g. single node and 8 GPUs.
I launch as follow:
OMP_NUM_THREADS=12 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nnodes=1 --nproc_per_node=8 my_python_script.py --my_parser_args …
Distributed training works well, for context I use HuggingFace trainer with FairScale --sharded_ddp zero_dp_2, however it seems that the whole script is executed as many times as there are devices, in this case 8 times. This can be seen as e.g. all prints are replicated.
To my understanding, ideally the whole part of the script before training (which doesnt require GPUs) should only run as a single main process e.g. importing data, setting up hyper-parameters and so on. And once the training starts, it should be replicated to as many processes as there are used GPUs.
What is the PyTorch way to locally disable multi-processing please ?
In HuggingFace, there is an option “main_process_first” which lets a block of code be run with the main process, e.g. dataset pre-processing with tokenizer.
Thanks !