Wow. Thanks for your sincerity.
Now I can specify where is the problem part, but I still don’t understand why this happens.
I tried 2 tests, first:
dataloader_iter = iter(dataloader)
def forward_and_log1(model, x, t, no_grad=False):
if no_grad:
logfile = open('logs/no_grad_enabled.log', 'a')
else:
logfile = open('logs/no_grad_disabled.log', 'a')
y = model(x)
logfile.write(' x: {0}\n'.format(x))
logfile.write(' O: {0}\n'.format(y))
logfile.write(' o: {0}\n'.format(torch.argmax(y, dim=1)))
logfile.write(' t: {0}\n\n'.format(t))
logfile.close()
for i in range(100):
x, t = next(dataloader_iter)
forward_and_log1(model, x, t, no_grad=False)
with torch.no_grad():
forward_and_log1(model, x, t, no_grad=True)
There was no problems(NaN values) in both 2 log files(no_grad_enabled.log, no_grad_disabled.log)
Then the problem is not with torch.no_grad()
second:
dataloader_iter = iter(dataloader)
def forward_and_log2(model, x, t, is_train_mode=True):
if is_train_mode:
logfile = open('logs/mode_train.log', 'a')
else:
logfile = open('logs/mode_eval.log', 'a')
y = model(x)
logfile.write(' x: {0}\n'.format(x))
logfile.write(' O: {0}\n'.format(y))
logfile.write(' o: {0}\n'.format(torch.argmax(y, dim=1)))
logfile.write(' t: {0}\n\n'.format(t))
logfile.close()
for i in range(100):
x, t = next(dataloader_iter)
model_GPU.train()
forward_and_log2(model, x, t, is_train_mode=True)
model_GPU.eval()
forward_and_log2(model, x, t, is_train_mode=False)
Here I found NaNs occured in eval mode logfile.
Problem was on eval mode!
But I still confused why this situation happens, moreover with your result(get NaNs with DP but not with DDP). maybe I found some BUGs with DP?
I found similar topic with my situation, but the solution was not clear.
Anyway, DDP gives the answer to me. Thank you.