I have read some tutorials about Distributed data parallel, however, I didn’t find out how to calculate train loss and accuracy after training one epoch correctly.
With DataParallel, we can easily calculate loss and accuracy since there is only one process. But with DDP, every gpu is running its own process and training its own data. The problem is,
- How to evaluate the training accuracy correctly?
- I follow the example here. ImageNet Example
Does the code redundantly calculate the same test accuray across multiple gpus? If so, is there any way to sample the testloader just like trainloader and avoid repetitive computing?