Hi, everyone!
Recently I want to train a network using 2 gpus on 1 machine(node).But I really get confused cause I can not find one which descrip train ,validiation ,saving checkpoints and load checkpoints concretely?
So is there any great example could help?
Here is the overview for the distributed training tools offered by PyTorch: https://pytorch.org/tutorials/beginner/dist_overview.html
If you are looking for data parallel training, you might want to start from DataParallel
?
Thanks, I found tutorial w.r.t. DataParallel is much easier to understand and implement.
However, using torch.nn.parallel.DistributedDataParallel
is much more difficult.
Yep, DataParallel
indeed is an easier entry point, but is not the most efficient solution. If you are looking for faster training speed or scaling to more machines later, DDP would still be the way to go.
Tanks, I find the example code of DDP on Imagenet is not easy to imitate to fit my code. Is there any more detailed example?
Here are some more general DDP examples/tutorials:
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
There are also several example projects of varying complexity on GitHub that use DistributedDataParallel. They would be a great reference for pytorch code across a variety of domains.
Besides the link @osalpekar posted above, here is a summary of all DDP docs we have currently: https://pytorch.org/tutorials/beginner/dist_overview.html#torch-nn-parallel-distributeddataparallel
Thanks a lot. I will have a try.
Thanks. I will have a try.