# What is reasonable range of per class weights?

Quick question:
I’m wondering if there is some reasonable range for per class weights (for dealing with unbalanced class representations)? Or are they somewhere later normalized? I’m specifically talking about weights attribute of nn.CrossEntropyLoss().

If there are many classes (say 7000), the approach to calculate weights per class as total_number_of_samples / number_of_samples_in_class results in big weights (say > 25000 for biggest classes). Isn’t that a problem?

you can always normalize the weights to be in 0-1 range. At the end what you do with the weight is give a relative importance to each class so it does not matter how big is the weight as long as that relative importance is preserved. You can also reduce your learning rate if your weights are big so your stochastic optimization algorithm does not oscillate.

Hope it helps.

But do weights affect the optimization? Because, as you’ve said, the optimizer need only relative importance. If the weigths somehow affect the optimizer, wouldn’t changing the range to 0-1 cause very slow learning?

Yes they can. If you take a simple problem and compute the derivatives with back propagation, you will see that the gradients are scaled by the value of your cost error.

Now imagine, if your cost is CE and 100*CE, in a convex optimization problem you will reach the same local minimum for the parameters as 100 is a constant value. However, the gradients are not scaled by the same value so the same learning rate can make your model oscillate.

Now, regarding different weights applied to the CE. Suppose a ternary problem with (unbalanced) classes c1, c2 and c3. In a normal setting the cost is computed as:

CE = \frac{1}{N} \sum CE(xi,ci)

where N is the total number of samples. So if you want to apply different weights to the cross entropy, such as:

CE = \frac{1}{N} \sum w_i \cdot CE(xi,ci)

in order to give more importance of the unbalanced class. The ideal thing is to give a relative importance such that the cost have the same relative value as if you do not use this weights. For instance given three samples one for each class. You can compute

CE = 1\cdot CE(x1,c1) + 2\cdot CE(x2,c2) + 3\cdot CE(x3,c3)

CE = 2\cdot CE(x1,c1) + 4\cdot CE(x2,c2) + 6\cdot CE(x3,c3)

CE = 100\cdot CE(x1,c1) + 200 \cdot CE(x2,c2) + 300 \cdot CE(x3,c3)

In all these cases the relative importance given to each cost is the same, however the gradients would not scale the same. Thus, the best way is to use normalized weights. Which in this case would be:

CE = 1/6 \cdot CE(x1,c1) + 2/6 \cdot CE(x2,c2) + 3/6 \cdot CE(x3,c3)

Hope it helps.

1 Like

Wow, thank you for detailed info, not only how to do this but also how it works! Although shouldn’t it be:
CE = 2\cdot CE(x1,c1) + 4\cdot CE(x2,c2) + 6\cdot CE(x3,c3) ?

Your welcome .Here is some toy code if you want to try and see that if you scale the CE by 10000 and keep the same learning rate you oscillate.

import torch
from torch import nn

x1=torch.randn(100,1)
x2=torch.randn(100,1)+2
t1=torch.zeros(100,)
t2=torch.ones(100,)

sample=torch.cat((x1,x2))
targets=torch.cat((t1,t2))

w1=torch.randn(1,1)
b=torch.zeros(1,)

epochs=100
lr=0.1

for e in range(epochs):
predictions=nn.functional.sigmoid(torch.mm(sample,w1)+b)
cost=nn.functional.binary_cross_entropy(predictions,targets)
cost.backward()