Hi, developers:
I try to use two machines to train the network, each machine has 8 gpus. I have compiled the pytorch from source code with nccl2. When I export the port
and address
in the environment, like this:
export MASTER_ADDR=$(host `hostname` | awk '{print $(NF)}')
export MASTER_PORT=8080
export WORLD_SIZE=2
export RANK=$OMPI_COMM_WORLD_RANK
Then I re-write the code:
import torch.distributed as dist
import os
# init
dist.init_process_group(backend='gloo', rank=int(os.environ['OMPI_COMM_WORLD_RANK']), world_size=int(os.environ['WORLD_SIZE']))
# main
main(rank=os.environ['OMPI_COMM_WORLD_RANK'])
The main
code is:
def main(rank):
... # some code
model = torch.nn.parallel.DistributedDataParallel(model.cuda(rank), device_ids=[0,1,2,3,4,5,6,7])
... # some code
for i, (input, target) in enumerate(train_loader):
input_var = torch.autograd.Variable(input.cuda(rank))
target_var = torch.autograd.Variable(target.cuda(rank))
In the end, I run the code, and it is stucked with no errors. Using the gdb --args python train.py
, it output like this:
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from ./distributed-pytorch/distributed-pytorch-0.4.0/anaconda2/bin/python...done.
(gdb) r
Starting program: ./distributed-pytorch/distributed-pytorch-0.4.0/anaconda2/bin/python train.py
[Thread debugging using libthread_db enabled]
Detaching after fork from child process 32217.
Detaching after fork from child process 32607.
Detaching after fork from child process 32609.
Is there something wrong in my code? Thanks.