How to know nproc_per_node in python code

hi community,
how could we know the nproc_per_node parameter in the code?

for example, running the following:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
           --nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
           --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
           and all other arguments of your training script)

how to know the nproc_per_node, master_addr in the YOUR_TRAINING_SCRIPT.py code?

Thank you!

Hello! You would need to duplicate them and pass those in as your own arguments. Example:

command (single node, 2 GPUs):

python -m torch.distributed.launch --nproc_per_node=2 
train_script.py --master_addr=localhost --nproc_per_node=2

train_script.py:

import argparse
parser = argparse.ArgumentParser()
# This is always passed in by default
parser.add_argument("--local_rank", type=int)
# These are your own arguments
parser.add_argument("--master_addr", type=str)
parser.add_argument("--nproc_per_node", type=int)
args = parser.parse_args()
print(args)

output:

Namespace(local_rank=0, master_addr='localhost', nproc_per_node=2)
Namespace(local_rank=1, master_addr='localhost', nproc_per_node=2)

We would be interested in hearing why you need the nproc_per_node and master_addr in your training script, generally just the rank is sufficient?

Hi Howard,
Thank you very much!
I want to implement some new asynchronous averaging algorithms for Federated learning using PyTorch. But, I am also new to the area. Maybe my implementation is not optimal.

I see! That makes sense, thank you. The launcher is not absolutely necessary but could be useful, here is the source code (it’s short) to glean some insight into what it is doing pytorch/launch.py at master · pytorch/pytorch · GitHub

For your use case, I would recommend looking into the RPC framework Distributed RPC Framework — PyTorch 1.8.1 documentation if you haven’t already.

Yes. I have noticed this document. thank you very much?
BTW, is Pytorch team working on a general federated learning framework that supports flexible control on each client (processors, GPUs) and the way of aggregating their gradients or model parameters.

To my knowledge, there isn’t a project for a general federated learning framework. Feel free to start a new thread regarding this as others may have insight, it will also be useful for feature tracking purposes.

1 Like

Is there a way to specify the:
-m torch.distributed.launch
programmatically? otherwise how can we do a debug session inside pycharm?