How to use 2gpus to train in single machine (node)?

Hao_Meng · August 4, 2020, 11:06am

Hi, everyone!
Recently I want to train a network using 2 gpus on 1 machine(node).But I really get confused cause I can not find one which descrip train ,validiation ,saving checkpoints and load checkpoints concretely?
So is there any great example could help?

mrshenli · August 4, 2020, 2:29pm

Here is the overview for the distributed training tools offered by PyTorch: https://pytorch.org/tutorials/beginner/dist_overview.html

If you are looking for data parallel training, you might want to start from DataParallel?

Hao_Meng · August 10, 2020, 8:36am

Thanks， I found tutorial w.r.t. DataParallel is much easier to understand and implement.
However, using torch.nn.parallel.DistributedDataParallel is much more difficult.

mrshenli · August 10, 2020, 2:54pm

Yep, DataParallel indeed is an easier entry point, but is not the most efficient solution. If you are looking for faster training speed or scaling to more machines later, DDP would still be the way to go.

Hao_Meng · August 17, 2020, 9:42am

Tanks, I find the example code of DDP on Imagenet is not easy to imitate to fit my code. Is there any more detailed example?

osalpekar · August 17, 2020, 9:33pm

Here are some more general DDP examples/tutorials:

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

There are also several example projects of varying complexity on GitHub that use DistributedDataParallel. They would be a great reference for pytorch code across a variety of domains.

mrshenli · August 18, 2020, 6:08pm

Besides the link @osalpekar posted above, here is a summary of all DDP docs we have currently: https://pytorch.org/tutorials/beginner/dist_overview.html#torch-nn-parallel-distributeddataparallel

Hao_Meng · August 18, 2020, 11:09pm

Thanks a lot. I will have a try.

Hao_Meng · August 18, 2020, 11:09pm

Thanks. I will have a try.