How can we skip a step with NaN loss in the training_step when using Distributed Data Parallel (DDP) across multiple machines and multiple GPUs?
i guess you could use function torch.nan_to_num(input , nan=0.0 , posinf=None , neginf=None , *** , out=None ) after each layer of models so that value wont be nan, but I suggest to trace back models and loss functions where cause nan because nan would be happened when there have some bug
Thank you for your reply. Can I set the gradient to 0 for variables with NaN before the optimizer? NaN issues rarely occur, but once they do, they can lead to errors in subsequent training. Therefore, I want to skip the step that causes NaN.
yes,but i suggest set none zero parameter would better