How can we skip a step with NaN loss in the training_step when using Distributed Data Parallel (DDP)?

wangyanhui666 · April 19, 2023, 5:17pm

How can we skip a step with NaN loss in the training_step when using Distributed Data Parallel (DDP) across multiple machines and multiple GPUs?

Wei-Hsin · April 20, 2023, 3:02am

i guess you could use function torch.nan_to_num(input , nan=0.0 , posinf=None , neginf=None , *** , out=None ) after each layer of models so that value wont be nan, but I suggest to trace back models and loss functions where cause nan because nan would be happened when there have some bug

wangyanhui666 · April 20, 2023, 3:00pm

Thank you for your reply. Can I set the gradient to 0 for variables with NaN before the optimizer? NaN issues rarely occur, but once they do, they can lead to errors in subsequent training. Therefore, I want to skip the step that causes NaN.

Wei-Hsin · April 20, 2023, 3:11pm

yes,but i suggest set none zero parameter would better