I am currently training the model using DDP but while entering the validation loop within the same epoch i enter using the GPU 0 like this,
if dist.get_rank() == 0:
print("Started Evaluating")
model.eval()
I get the validation accuracy and other metrics properly without any error. But
when i plotted the confusion matrix for each epoch, I noticed the sum of number of instances along each row (total Ground truth labels per class) across different epochs are different.
Please let me know how to validate on Single GPU when the model is wrapped with DDP.
I think it depends on how your eval data is setup. Is only rank 0 loading the eval data? Otherwise, you could still do eval as data parallel and make sure to aggregate metrics across data parallel workers.
Yes youre right. The validation loop only runs on rank 0. Eval data loads on a single gpu and performs validation while other gpus would waits. Right now, it works totally fine with the same setup, just that the training time and validation takes up lot of time. Thats why I was confused whether if this is the standard way to perform validation on single gpu.