How multi-threading is managed within ddp?

sbelharbi · September 8, 2021, 12:30am

you are so right!!!
distributed turned off the multithreading and the c++ function was too late when it asks for the max nbr threads which will be 1.

– begin side question
side question: how to ask ddp to log into a specific file?
the dd logging is in terminal which makes catching warnings difficult.
there are a lot of messages but they are lost.
is there a way to ask ddp to write logs in a specific file?
or can we properly manipulate the instance of log here to do that? thanks
the first time i used ddp, it starts throwing logs in terminal which is unpractical for debug.
there should be a way to tell ddp to log into a file.
the doc1 and doc2 do not seem to cover this.
i didnt investigate further as other things have more priority.
thanks

github.com

pytorch/pytorch/blob/65e6194aeb3269a182cfe2c05c122159da12770f/torch/distributed/run.py#L325

    
      
          import torch
          from torch.distributed.argparse_util import check_env, env
          from torch.distributed.elastic.multiprocessing import Std
          from torch.distributed.elastic.multiprocessing.errors import record
          from torch.distributed.elastic.rendezvous.utils import _parse_rendezvous_config
          from torch.distributed.elastic.utils import macros
          from torch.distributed.elastic.utils.logging import get_logger
          from torch.distributed.launcher.api import LaunchConfig, elastic_launch
          
          

          
log = get_logger()
          
          

          
def get_args_parser() -> ArgumentParser:
              """Helper function parsing the command line options."""
          
          
    parser = ArgumentParser(description="Torch Distributed Elastic Training Launcher")
          
          
    #
              # Worker/node size related arguments.
              #

–end side question

so, now i configure export OMP_NUM_THREADS=32 before running and the runtime is back to 70ms with ddp+2gpus!!! this is cool!
thank you very much! this is a life saver!

also, i was getting this warning which is impossible to read because it is printed on terminal!

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to 
be 1 in default, to avoid your system being overloaded, please further tune 
the variable for optimal performance in your application as needed.
*****************************************

which explains everything!

also, i am using distributed.launch which is explains why i am getting this warning

The module torch.distributed.launch is deprecated and going to be removed
 in future.Migrate to torch.distributed.run

in the examples they provided, they use launch. i should probably switch to run, it seems more up to date.
also, they used launch in this under Launch utility. probably the doc needs to be updated probably in the next release. i am using pytorch 1.9.0.

again, thank you so much! this was very helpful!