Hello. I found some weird situation on ignite’s evaluator.
I have a pytorch dataloader and well-trained model.
when I try to extract output from model manually like below, it works well…
logfile = open('logs/manual.log', 'w')
dataloader_iter = iter(dataloader)
for i in range(500):
x, t = next(dataloader_iter)
y = model(x)
logfile.write('{0}th iteration...\n'.format(i))
logfile.write(' x: {0}\n'.format(x))
logfile.write(' O: {0}\n'.format(y))
logfile.write(' o: {0}\n'.format(torch.argmax(y, dim=1)))
logfile.write(' t: {0}\n\n'.format(t))
logfile.close()
Then the logfile shows(focus on direct output from model(O):
1th iteration...
<I skiped other things...>
O: tensor([[-2.8684e+00, -2.4779e+00, 1.4409e+00, 4.2450e+00, 1.3716e+00,
3.1618e+00, 2.5039e+00, -1.3585e-03, -3.0291e+00, -3.1306e+00],
[ 3.1552e+00, 1.8194e+00, -3.0561e-01, -1.8409e+00, -2.6845e-01,
-3.2020e+00, -3.0243e+00, -7.1415e-01, 2.4644e+00, 1.6819e+00],
[ 8.4511e-01, -4.4308e-01, 1.1573e+00, 1.2453e-01, 5.5392e-01,
6.7758e-03, -1.0871e+00, 2.9498e-01, -7.7099e-01, -2.3612e-01],
[ 4.2768e+00, 1.0600e+00, 2.5328e-01, -8.9427e-01, -1.2167e+00,
-2.5544e+00, -4.6379e+00, -9.9491e-01, 4.3370e+00, 3.3613e-01],
[-1.5431e+00, -2.1083e+00, 2.6824e+00, 1.7168e+00, 2.7112e+00,
9.7829e-01, 2.7005e+00, 6.7293e-01, -3.5220e+00, -3.2395e+00],
[-1.7089e+00, -6.2691e-02, -6.5357e-01, 1.6353e+00, 3.9782e-01,
1.3109e+00, 3.1719e-01, 2.2915e-01, -2.3173e+00, 1.5951e+00],
[ 3.6005e-01, 6.5493e+00, -1.8465e+00, 1.6237e-01, -2.9798e+00,
-1.6791e+00, -3.1212e+00, -6.6102e-01, -1.4621e+00, 4.4547e+00],
[-4.2739e-02, -3.4736e+00, 2.4689e+00, 4.2563e-01, 2.9417e+00,
-7.3102e-01, 2.5900e+00, -1.5313e-01, -6.9163e-01, -1.9232e+00],
[ 5.4870e-01, -3.8533e+00, 1.4458e+00, 2.0452e+00, 1.9034e+00,
2.4421e+00, -1.2948e+00, 6.9678e-01, 3.8147e-01, -3.9030e+00],
[ 1.9710e+00, 2.5695e+00, 6.1807e-01, -3.4376e-01, 7.6892e-02,
-2.4107e+00, -8.0755e-01, -6.9068e-01, -6.7928e-01, 3.1438e-01],
[-2.5117e-01, -4.4562e+00, 2.7807e+00, 2.3242e+00, 3.1733e+00,
2.7852e+00, 5.5874e-01, -1.3494e-01, 1.5951e-01, -5.1482e+00],
[-7.8764e-01, 1.3296e+00, -1.2178e+00, 5.5745e-01, -1.5662e+00,
-5.0908e-01, -1.1492e+00, 4.3546e-01, -1.2073e+00, 3.8453e+00],
[-8.3729e-01, -2.7771e+00, 1.8575e+00, 2.3420e+00, 1.8659e+00,
2.1749e+00, 1.6091e+00, -4.0015e-01, -1.2970e+00, -3.6298e+00],
[-1.2611e+00, -2.7069e-02, 9.4248e-02, 8.1586e-01, 1.2401e+00,
1.4915e+00, -1.5474e+00, 2.9358e+00, -2.9514e+00, -7.6922e-01],
[ 2.9973e+00, 1.5974e+00, 3.1030e-01, -5.9278e-01, -1.3065e+00,
-3.6284e+00, -1.7095e+00, -2.5131e+00, 2.5173e+00, 2.9853e+00],
[ 2.8719e+00, 2.5325e-01, 1.6667e+00, -3.4530e-01, 1.8475e-01,
-2.7671e+00, 1.9819e+00, -3.5765e+00, 2.2073e+00, -1.5772e+00],
[-1.3318e+00, -1.1345e+00, 9.9565e-01, 3.0902e+00, -7.6009e-02,
3.6095e+00, -1.6780e+00, 9.3089e-01, -1.7266e+00, -2.3228e+00],
[-3.5282e-01, -2.2260e-01, -5.0407e-01, -8.8783e-02, 1.5501e+00,
-6.1404e-01, -1.8794e+00, 2.5478e+00, -2.1103e+00, 2.4213e+00],
[ 3.3746e+00, 1.5732e+00, -9.2370e-01, -7.0857e-01, -1.9817e+00,
-2.9782e+00, -3.5029e+00, -2.0559e+00, 3.9011e+00, 3.1880e+00],
[-2.7136e+00, -2.1364e+00, 2.3538e+00, 2.2541e+00, 3.3178e+00,
2.8204e+00, 2.5427e+00, 1.8494e+00, -4.9199e+00, -4.0796e+00]],
device='cuda:0', grad_fn=<GatherBackward>)
but like below, when I used evaluator, I found something strange:
logfile2 = open('logs/evaluator.log', 'w')
evaluator = create_evaluator(model)
@evaluator.on(Events.ITERATION_COMPLETED)
def log_inference_output(engine):
y_pred, y = engine.state.output
logfile2.write('{0}th iteration...\n'.format(engine.state.iteration))
logfile2.write(' o: {0}\n'.format(y_pred))
logfile2.write(' t: {0}\n\n'.format(y))
evaluator.run(dataloader)
logfile2.close()
1th iteration...
o: tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]], device='cuda:0')
t: tensor([3, 8, 8, 0, 6, 6, 1, 6, 3, 1, 0, 9, 5, 7, 9, 8, 5, 7, 8, 6],
device='cuda:0')
Why this situation happens? I tried to figure out this for several days but I can’t get the answer.
My environment is:
- pytorch: 1.7.0
- ignite: 0.4.2
and I’m using 8 GPUs with dataparalled model.
Any helps will be appreciated. thanks.