How to use 2gpus to train in single machine (node)?

Hi, everyone!
Recently I want to train a network using 2 gpus on 1 machine(node).But I really get confused cause I can not find one which descrip train ,validiation ,saving checkpoints and load checkpoints concretely?
So is there any great example could help?

Here is the overview for the distributed training tools offered by PyTorch: https://pytorch.org/tutorials/beginner/dist_overview.html

If you are looking for data parallel training, you might want to start from DataParallel?

Thanks, I found tutorial w.r.t. DataParallel is much easier to understand and implement.
However, using torch.nn.parallel.DistributedDataParallel is much more difficult.

Yep, DataParallel indeed is an easier entry point, but is not the most efficient solution. If you are looking for faster training speed or scaling to more machines later, DDP would still be the way to go.

Tanks, I find the example code of DDP on Imagenet is not easy to imitate to fit my code. Is there any more detailed example?

Here are some more general DDP examples/tutorials:

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

There are also several example projects of varying complexity on GitHub that use DistributedDataParallel. They would be a great reference for pytorch code across a variety of domains.

2 Likes

Besides the link @osalpekar posted above, here is a summary of all DDP docs we have currently: https://pytorch.org/tutorials/beginner/dist_overview.html#torch-nn-parallel-distributeddataparallel

1 Like

Thanks a lot. I will have a try.

Thanks. I will have a try.