Distributed evaluation with DDP

yiftach · September 11, 2023, 12:11pm

Hi,

When scaling training from a single worker to multiple workers (say, multiple GPUs on the same machine), DDP provides abstractions so that I do not have to think about how to best implement synchronization between the workers.

On evalutation, however, it seems currently no abstraction/best practice exists, and I have to resort to using lower-level distributed calls, to gather/reduce all my metrics into a single one. Is that correct, or does torch somehow provide a similar experience for evaluation?

I’ve read Ddp: evaluation, gather output, loss, and stuff. how to? but wonder if things have changed ever since.

iamexperimentingnow · September 23, 2023, 5:04pm

@ptrblck @suraj.pt It would be really helpful you provide some guidance on this. Thanks

ptrblck · September 23, 2023, 5:30pm

I would start with the PyTorch ImageNet example and use it as a template.

yiftach · September 23, 2023, 5:39pm

Thanks for the helpful link. Defining an AverageMeter class and calling reduce seems to be very common (I do it in my code, based on other multiple repos that use this idea). Since this practice is common enough to be needed in many different places including what should be the most simple code (imagenet evaluation), any plans to be incorporated into PyTorch itself? Any reason to avoid it?

iamexperimentingnow · September 23, 2023, 5:45pm

thanks @ptrblck I will work on this.