Distributed evaluation with DDP


When scaling training from a single worker to multiple workers (say, multiple GPUs on the same machine), DDP provides abstractions so that I do not have to think about how to best implement synchronization between the workers.

On evalutation, however, it seems currently no abstraction/best practice exists, and I have to resort to using lower-level distributed calls, to gather/reduce all my metrics into a single one. Is that correct, or does torch somehow provide a similar experience for evaluation?

I’ve read Ddp: evaluation, gather output, loss, and stuff. how to? but wonder if things have changed ever since.

@ptrblck @suraj.pt It would be really helpful you provide some guidance on this. Thanks

I would start with the PyTorch ImageNet example and use it as a template.

Thanks for the helpful link. Defining an AverageMeter class and calling reduce seems to be very common (I do it in my code, based on other multiple repos that use this idea). Since this practice is common enough to be needed in many different places including what should be the most simple code (imagenet evaluation), any plans to be incorporated into PyTorch itself? Any reason to avoid it?

thanks @ptrblck I will work on this.