Is there anyway to use multi-gpu during training

For the memory in GPU is not enough to hold a big batch during training, and if I set a small batchsize such as 12 or 14, it will cost a long time to train the model. So is there any way to use multi gpus to speed up training.

You are probably looking for http://pytorch.org/docs/master/nn.html#dataparallel-layers-multi-gpu-distributed

I tried to use torch.distributed.init_process_group(backend=‘gloo’, init_method=‘env://’, world_size=2, rank=0) to initialize it. I do not know what should “init_method” be
It raise a problem, and here it is.

File “train.py”, line 64, in main
torch.distributed.init_process_group(backend=‘gloo’, init_method=‘env://’, world_size=2, rank=0)
File “/usr/local/lib/python3.5/dist-packages/torch/distributed/init.py”, line 49, in init_process_group
group_name, rank)
RuntimeError: failed to read the MASTER_PORT environmental variable; maybe you forgot to set it? at /pytorch/torch/lib/THD/process_group/General.cpp:17

if you use init method env:// you have to set a few environment variables. see documentation here http://pytorch.org/docs/0.3.0/distributed.html

but if i set init_method=‘tcp://192.168.1.114:23456’, where 192.168.1.114 is IP of my PC, it will take a long time to do something without any log. i do not know whether it is still alive or doing something wrong

Hi,
Do you understand the meaning of this init_method ?
I’m confused about 23456.

23456 is a network port on the machine that we will communicate with. network ports range from 1 through 65535, and any port above 1024 can be used by applications on the operating system to communicate with each other (and across the network)

1 Like

Thanks for your response.
I have understand the 23456.
However,
if I use the DistributedDataParallel function to utilize the multi-GPUs on one server machine,
the init_method=‘tcp;//server_ip:port’ ?

PS:
If I use the DataParallel function,
warning message will be raise while using rnn or lstm.

UserWarning: RNN module weights are not part of single contiguous chunk of memory.
This means they need to be compacted at every call, possibly greatly increasing memory usage.
To compact weights again call flatten_parameters().

yes that would be the init method. Have a look at this example: https://github.com/pytorch/examples/tree/master/imagenet#multi-processing-distributed-data-parallel-training

it would be 'tcp://127.0.0.1:23456' for example. Generally,127.0.0.1` means localhost