Different loss after iteration by same seed with different reduction way of nll_loss in gcn

Hello, everyone:

I faced a problem about reduction way of nll_loss, which give me extremely different result by same seed in GCN.

I am using GCN to do binary text classification. First, I use class weight in the nll_loss as belowed. It gives me good result. As I used same seed, everytime I rerun the GCN, I can get the same result. I can check the log each iteration of loss is same.

np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.cuda:
    torch.cuda.manual_seed(args.seed)

x_class = y.sum(dim=0).float()
z_weight = (1.0/x_class)*x_class.sum(0)/2.0
y = y.max(1)[1].type_as(labels)
return F.nll_loss(preds, y, z_weight)

But when I changed the nll_loss to reduction “none”, although the loss return is also same as above at the beginning, which value is 0.7351 at the first time, it will show different result after multi iteration at the end.

x_class = y.sum(dim=0).float()
z_weight = (1.0/x_class)*x_class.sum(0)/2.0
y = y.max(1)[1].type_as(labels)
loss_pre = F.nll_loss(preds, y, reduction='none')
final_weight = z_weight[y]
return (loss_pre * final_weight).sum() / final_weight.sum()
#original loss
Epoch: 0001
 | loss: 0.7351
Epoch: 0002
 | loss: 0.6355
Epoch: 0003
 | loss: 3.2794
...
Epoch: 0010
 | loss: 0.6107
...
Epoch: 0200
 | loss: 0.0207

#loss of reduction none
Epoch: 0001
 | loss: 0.7351
Epoch: 0002
 | loss: 0.6355
Epoch: 0003
 | loss: 3.2797
...
Epoch: 0010
 | loss: 0.6097
...
Epoch: 0200
 | loss: 0.0239

Could anyone give some help? Thanks indeed.
My environment is Python 3.6.9, pytorch 1.3.0 with CPU.

Are both approaches deterministic in isolation? I.e. are you getting the same results if you rerun the reduction='none' code?
If so, then I think the difference between both approaches could come from a different order of the reductions and thus accumulated small absolute errors due to the limited floating point precision.

Thanks your reply. @ptrblck
Yes, the returned loss of two reduction way are same, which is 0.7351 at the first time.
When I rerun the reduction with none, it always show 0.7351 at the first time.
I confirmed it by using ipdb to check.
But after several iteration, the loss began to differ.
So, How to fix the calculation order of reduction.
The reason why I wanna convert to reduction of none, is that I wanna combine sample weight to existed class weight’s loss.

No, I meant to ask if the second approach executed multiple times is yielding the same results (not in comparison to the first approach).

Yes, I have modified my first reply. When I rerun the code with reduction of none in the same seed, it output the same result with loss from first iteration to the last iteration.
In fact, when I guessed from the result of reduciton of none, it seems that it doesn’t fully cooperate the class weight info.

try doing this segment on cpu, the issue is that some operators used during backward are non-deterministic, so gradients differ in re-runs; in particular, indexing/gathering with many-to-one element correspondence

Thanks for your reply. @googlebot
In fact I run these codes in CPU already, but it shows the result above.
Besides that, I rerun the reduction of none serveral times with same seed, it prove the same result and same loss.
From my observation for the result returned by reduction of none, it seems that it doesn’t cooperate the class weight info.
So I wander is it a bug or other problem?

Oh, then above answer is likely correct:

You’re either not looking beyond first 4 digits of loss, or differences only manifest in gradients.

Not sure how you’d “fix” that, and if you need to do adjustments to “mean” mode, you’d lose result identity anyway…

But I still think it’s not related to float precision, cause I observe from the result of reduction of none, it seems that it doesn’t cooperate the class weight info.
I still don’t know how to calculate the loss under the none correctly…

Excuse me,

Can I have the method to overcome such problem?
Cause in my situation, the reduction of mean return a good result, however, the reduction of none return a totally bad result.
If I wanna use the reduction of none to cooperate the sample weight, I must overcome the problem above.
Appreciate for your help.