Evaluator returns NaN?

Wow. Thanks for your sincerity.

Now I can specify where is the problem part, but I still don’t understand why this happens.

I tried 2 tests, first:

dataloader_iter = iter(dataloader)

def forward_and_log1(model, x, t, no_grad=False):
    if no_grad:
        logfile = open('logs/no_grad_enabled.log', 'a')
    else:
        logfile = open('logs/no_grad_disabled.log', 'a')

    
    y = model(x)
    
    logfile.write('    x: {0}\n'.format(x))
    logfile.write('    O: {0}\n'.format(y))
    logfile.write('    o: {0}\n'.format(torch.argmax(y, dim=1)))
    logfile.write('    t: {0}\n\n'.format(t))
    
    logfile.close()

for i in range(100):
    x, t = next(dataloader_iter)
    
    forward_and_log1(model, x, t, no_grad=False)
    with torch.no_grad():
        forward_and_log1(model, x, t, no_grad=True)

There was no problems(NaN values) in both 2 log files(no_grad_enabled.log, no_grad_disabled.log)
Then the problem is not with torch.no_grad()

second:

dataloader_iter = iter(dataloader)

def forward_and_log2(model, x, t, is_train_mode=True):
    if is_train_mode:
        logfile = open('logs/mode_train.log', 'a')
    else:
        logfile = open('logs/mode_eval.log', 'a')
        
    y = model(x)
    
    logfile.write('    x: {0}\n'.format(x))
    logfile.write('    O: {0}\n'.format(y))
    logfile.write('    o: {0}\n'.format(torch.argmax(y, dim=1)))
    logfile.write('    t: {0}\n\n'.format(t))
    
    logfile.close()

    
for i in range(100):
    x, t = next(dataloader_iter)
    
    model_GPU.train()
    forward_and_log2(model, x, t, is_train_mode=True)
    model_GPU.eval()
    forward_and_log2(model, x, t, is_train_mode=False)

Here I found NaNs occured in eval mode logfile.
Problem was on eval mode!
But I still confused why this situation happens, moreover with your result(get NaNs with DP but not with DDP). maybe I found some BUGs with DP?
I found similar topic with my situation, but the solution was not clear.

Anyway, DDP gives the answer to me. Thank you.