hi community,
how could we know the nproc_per_node parameter in the code?
for example, running the following:
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)
how to know the nproc_per_node, master_addr in the YOUR_TRAINING_SCRIPT.py code?
import argparse
parser = argparse.ArgumentParser()
# This is always passed in by default
parser.add_argument("--local_rank", type=int)
# These are your own arguments
parser.add_argument("--master_addr", type=str)
parser.add_argument("--nproc_per_node", type=int)
args = parser.parse_args()
print(args)
Hi Howard,
Thank you very much!
I want to implement some new asynchronous averaging algorithms for Federated learning using PyTorch. But, I am also new to the area. Maybe my implementation is not optimal.
I see! That makes sense, thank you. The launcher is not absolutely necessary but could be useful, here is the source code (it’s short) to glean some insight into what it is doing pytorch/launch.py at master · pytorch/pytorch · GitHub
Yes. I have noticed this document. thank you very much?
BTW, is Pytorch team working on a general federated learning framework that supports flexible control on each client (processors, GPUs) and the way of aggregating their gradients or model parameters.
To my knowledge, there isn’t a project for a general federated learning framework. Feel free to start a new thread regarding this as others may have insight, it will also be useful for feature tracking purposes.