Is there anyway to use multi-gpu during training

brightnesss · January 29, 2018, 3:00am

For the memory in GPU is not enough to hold a big batch during training, and if I set a small batchsize such as 12 or 14, it will cost a long time to train the model. So is there any way to use multi gpus to speed up training.

SimonW · January 29, 2018, 4:37am

You are probably looking for http://pytorch.org/docs/master/nn.html#dataparallel-layers-multi-gpu-distributed

brightnesss · January 29, 2018, 9:30am

I tried to use torch.distributed.init_process_group(backend=‘gloo’, init_method=‘env://’, world_size=2, rank=0) to initialize it. I do not know what should “init_method” be
It raise a problem, and here it is.

File “train.py”, line 64, in main
torch.distributed.init_process_group(backend=‘gloo’, init_method=‘env://’, world_size=2, rank=0)
File “/usr/local/lib/python3.5/dist-packages/torch/distributed/init.py”, line 49, in init_process_group
group_name, rank)
RuntimeError: failed to read the MASTER_PORT environmental variable; maybe you forgot to set it? at /pytorch/torch/lib/THD/process_group/General.cpp:17

smth · January 29, 2018, 9:44am

if you use init method env:// you have to set a few environment variables. see documentation here http://pytorch.org/docs/0.3.0/distributed.html

brightnesss · January 30, 2018, 5:12am

but if i set init_method=‘tcp://192.168.1.114:23456’, where 192.168.1.114 is IP of my PC, it will take a long time to do something without any log. i do not know whether it is still alive or doing something wrong

DoubtWang · December 26, 2018, 3:47am

Hi,
Do you understand the meaning of this init_method ?
I’m confused about 23456.

smth · December 26, 2018, 11:40pm

23456 is a network port on the machine that we will communicate with. network ports range from 1 through 65535, and any port above 1024 can be used by applications on the operating system to communicate with each other (and across the network)

DoubtWang · December 27, 2018, 12:25pm

Thanks for your response.
I have understand the 23456.
However,
if I use the DistributedDataParallel function to utilize the multi-GPUs on one server machine,
the init_method=‘tcp;//server_ip:port’ ?

PS:
If I use the DataParallel function,
warning message will be raise while using rnn or lstm.

UserWarning: RNN module weights are not part of single contiguous chunk of memory.
This means they need to be compacted at every call, possibly greatly increasing memory usage.
To compact weights again call flatten_parameters().

smth · December 27, 2018, 9:54pm

yes that would be the init method. Have a look at this example: https://github.com/pytorch/examples/tree/master/imagenet#multi-processing-distributed-data-parallel-training

it would be 'tcp://127.0.0.1:23456' for example. Generally,127.0.0.1` means localhost