Validation when using DDP

mariosconsta · April 8, 2024, 11:05am

Simple question but I did not find any relevant information about it. I am currently using DDP and I am training on 4 cards, every 10 epochs I have a function that runs on the validation set like this:

mae, mse = self.validate()

My question is, should I have it run only on the main GPU, like this?

if self.local_rank == 0:
    mae, mse = self.validate()

I feel its a bit pointless to have all cards running validation but I haven’t been able to find info that state otherwise.

Thank you for reading

divinho · April 8, 2024, 2:52pm

It makes sense to have all cards running validation as you can complete it sooner. (which means you can start training again sooner)

note for the correct result (combined from all cards) you would need to run torch.distributed.all_reduce()

mariosconsta · April 9, 2024, 8:19am

I have never used this method or seen it in any of the tutorials. Do I use this for validation or training as well?

divinho · April 10, 2024, 1:20am

It’s going to be difficult for you to make progress if you don’t understand the tools you are using. I suggest understanding how DDP works at a high level.

You should not do it for training, that would be bad as it would slow things down for very little benefit.

mariosconsta · April 10, 2024, 7:16am

This is why I am here Thank you for your input, have a lovely day!

divinho · April 12, 2024, 3:02pm

To add an answer to your original question. You could do what you suggested

if self.local_rank == 0:
    mae, mse = self.validate()

and that would work (and be the simplest solution)!

Your overall training would be slower because of the validation being slower. But maybe that’s fine for you.