How weights are being used in Cross Entropy Loss

Since I checked the doc and the explanation from weights in CE
But When I was checking it for more than two samples, it is showing different results as below
For below snippet

inp = tensor([[0.9860, 0.1934],
        [0.9590, 0.3538],
        [0.1502, 0.9544],
        [0.7666, 0.0535],
        [0.1600, 0.3133],
        [0.1827, 0.8578],
        [0.2727, 0.7105],
        [0.3965, 0.0156]])

target = tensor([1, 1, 1, 0, 0, 0, 1, 1])

cl_wts = 1./torch.tensor([5., 3.])

loss = nn.CrossEntropyLoss()
loss_weighted = loss = nn.CrossEntropyLoss(weight = cl_wts)

l1 = loss(inp, target)
print(l1)---> tensor(0.7793)

l_wt = loss_weighted(inp, target)
print(l_wt) ---> tensor(0.7839)

When I was checking it manually as 

logits = Softmax(inp)
:point_down:
logits = tensor([[0.6884, 0.3116],
        [0.6469, 0.3531],
        [0.3091, 0.6909],
        [0.6711, 0.3289],
        [0.4617, 0.5383],
        [0.3374, 0.6626],
        [0.3923, 0.6077],
        [0.5941, 0.4059]])
 
manual_loss1 =  -(np.log(0.3116) + np.log(0.3531) + np.log(0.6909) + np.log(0.6711) + np.log(0.4617) + np.log(0.3374) + np.log(0.6077) + np.log(0.4059))
manual_loss = manual_loss/8  --->8 is because of mini batch size
print(manual_loss) ---> 0.7793355874570308, which is equivalent to l1

However for weighted 
man_loss_weighted = -(np.log(0.3116)*0.2 + np.log(0.3531)*0.2 + np.log(0.6909)*0.2 + np.log(0.6711)*0.33 + np.log(0.4617)*0.33 + np.log(0.3374)*0.33 + np.log(0.6077)*0.2 + np.log(0.4059)*0.2)/(0.2+0.33)
man_loss_weighted /=8
print(man_loss_weighted)---> 0.3633250361678566   
Which is not equivalent to l2 weighted loss, 

How is it being computed. Any help would be appreciated
Thank you 

Hi Shakeel!

You have two errors in your computation of man_loss_weighted:

First you need to divide by the sum of the weights used for each
individual sample.*

Second, you have mixed up your class-0 and class-1 weights.

Here is the correct manual computation:

>>> -(np.log(0.3116)*0.333333 + np.log(0.3531)*0.333333 + np.log(0.6909)*0.333333 + np.log(0.6711)*0.2 + np.log(0.4617)*0.2 + np.log(0.3374)*0.2 + np.log(0.6077)*0.333333 + np.log(0.4059)*0.333333)/(0.333333 + 0.333333 + 0.333333 + 0.2 + 0.2 + 0.2 + 0.333333 + 0.333333)
0.784032260475451

*) You have divided first by cl_wts[0] + cl_wts[1]. But you need
to divide by the actual weights used for each sample in the batch.
Suppose your batch contained only class-0 samples. In such a case
it wouldn’t wouldn’t make sense to use cl_wts[1] in the computation
at all. Then you divide by 8. But suppose that all of your weights
were 1. You would first divide by the sum of those weights, which
would be 8, and then you would divide by 8 again, which would be
wrong.

Best.

K. Frank

Thank you @KFrank
but since the weight tensor is

cl_wts = 1./torch.tensor([5., 3.]) = tensor([0.2, 0.333])

Why are you multiplying samples with label =1 by 0.333 (Doesn’t this mean we are giving it more importance , since it is already higher than other class. Shouldn’t we multiply the samples with lower class with more weights [in our case 0.333 is more and it should be multiplied to lower class and 0.2 for higher class] )

Could you please clarify it

One more thing when i pass the weights

tensor([0.2, 0.333])

does it work like the class_0 will get weight 0.2 and class_1 will get 0.333 and so on ??

Hi Shakeel!

I’m just using the weight tensor you specified.

Yes, this would be the typical approach. Note, however, I wouldn’t
reweight a batch of size 8 with the counts of the classes in that batch.
I would typically weight my classes based on the (approximate) class
counts in my whole training set (and I wouldn’t bother reweighting
unless the classes were much more imbalanced than 5-to-3).

Yes, that is correct. That is what I meant when I said that “you mixed
up your class-0 and class-1 weights.”

Best.

K. Frank

@KFrank
Thank you this makes sense now

@KFrank
How does weights effect in propagation so that we can be sure model is focusing on the minority class

How does weights effect in back propagation so that we can be sure model is focusing on the minority class
Does it boosts the gradient or the it increases the number of updates. Could you please clarify this

Hi Shakeel!

I suggest that you try a quick test.

Don’t use a model. Just create pred with requires_grad = True:

pred = torch.randn (10, 2, requires_grad = True)

Then create some target class labels:

targ = torch.randint (2, (10, ))

Then calculate and backpropagate CrossEntropyLoss with
various choices for weight and see what happens:

torch.nn.CrossEntropyLoss (weight = my_class_weights) (pred, targ).backward()
print (pred.grad)

Best.

K. Frank