Per-class and per-sample weighting

That sounds right!
I’m not sure, what S samples are in your example, but here is a small dummy code snippet showing, what I mean:

batch_size = 10
nb_classes = 2

model = nn.Linear(10, nb_classes)
weight = torch.empty(nb_classes).uniform_(0, 1)
criterion = nn.CrossEntropyLoss(weight=weight, reduction='none')

# This would be returned from your DataLoader
x = torch.randn(batch_size, 10)
target = torch.empty(batch_size, dtype=torch.long).random_(nb_classes)
sample_weight = torch.empty(batch_size).uniform_(0, 1)

output = model(x)
loss = criterion(output, target)
loss = loss * sample_weight
loss.mean().backward()

Do you mean each batch has a different size or what exactly are your samples?
Could you post a random tensor showing one sample batch?

EDIT: Probably it’s also a good idea to normalize the sample weights so that the range of the loss will approx. have the same range and won’t depend on the current sample distribution in your batch.

loss =(loss * sample_weight / sample_weight.sum()).sum()

I’m not sure in what range your weights are, so maybe it’s not necessary.

17 Likes

Wow, thank you. The code helps me understand better what you were saying about how to actually implement the operation at the end.

Let’s see… I’m trying to think ahead to the final model, so bear with me. Also, I shouldn’t have used the word sample the way I did. Hopefully I explain it better below.

For each of the N samples (each individual recording), there are D divisions (the number of chunks that the recording is divided into (given that each recording is a different length, this number will vary), and four 1D features (each feature is a different length, and enter the graph at a different location - at least, that’s the plan). So, I see that as 4 input tensors for each sample/recording: (D,w), (D,x), (D,y), (D,z).

For each sample/recording, the output would be a 2D tensor after the softmax of (D, C).

Then, I would have a 1D class-weight tensor of 1xC classes, and for each sample, a 1D division(sample)-weight tensor of 1xD divisions.

Exactly! The 1D division(sample)-weight tensor would be also returned from the DataLoader or do you need to calculate and load it from “somewhere else”?
As far as I understand the divisions vary based on some criteria of your recording.
Would you want to process all divisions of a recording in a single batch or is a “windowed” approach plausible?
A sliding window approach can sometimes be a bit tricky so don’t hesitate to ask here for some hints. :wink:

The division(sample)-weight will be calculated for every sample/recording ahead of time, so I guess it can also be returned with DataLoader.

The division count varies just because each recording is a different length, with the largest being about 60% longer than the smallest. It’s just the nature of the data.

So, in my previous models using keras, I did use a windowed approach. However, because tensorflow requires the complete tensor to exist ahead of time, this was a massive waste of memory, since almost all of the data is duplicated using a sliding window. And this massive waste is what led to keeping the windows smaller than I would’ve liked; since the model could only see a small window at a time, it could never grasp the long-range trends and cycles that happen in the data. I’m exploring a combination of spatial pyramid pooling (SPP) and the [TCN] architecture as a possible solution to the issue that the 1st feature for each division is very long (and in its unprocessed form, also varies in length (the sampling rate is different)), and I would also like the network to train on the entire recording at the same time, so it can “see” the long-range cycles.

Hopefully that all made sense.

(https://github.com/locuslab/TCN/tree/master/TCN)

Hi again @ptrblck
I have been thinking about this answer and I’m confused about a few situations where this solution could potentially be problematic.
Consider a situation where a few samples have a weight of 0. That is, intuitively, those samples (or observations or entries) are meaningless.

loss = loss * sample_weight could result in a loss of 0 for those particular samples. Hence during gradient computation, wouldn’t the network technically be looking at a loss of 0 (perfect classification / regression) for samples which we weren’t confident to begin with? Do you think backprop on a loss of 0 could result in network crashes?

Furthermore, wouldn’t sample_weight.requires_grad be true? Do you think it could create complications during backprop?

I’ve been doing something similar and my network almost always crashes mid training.

@Rakshit_Kothari, I cannot speak to if this -could- cause a crash, or if sample_weight -should- require_grad. However, I am confident, as I trained several hundred networks using this method, that it worked for me.

1 Like

Hi @apytorch, is it possible that some of your sample weights during training could have a value of 0?

@Rakshit_Kothari I’m positive that at least 10% of my weights were 0.

1 Like

@ptrblck: How to multiple the sample weight size of 4 and loss size of 4x8x8. This is an example

import torch
import numpy as np
import torch.nn as nn
num_class =2
b,h,w =4,8,8
input = torch.randn((b, 1, h, w), requires_grad=True)
target = torch.empty((b, h, w), dtype=torch.long).random_(num_class)
pred = torch.rand((b, num_class, h, w), dtype=torch.float)
criterion = nn.CrossEntropyLoss(reduction='none')
loss = criterion(pred, target)
sample_weight = torch.empty(b).uniform_(0, 1)

print (sample_weight.shape, loss.shape)
loss =(loss * sample_weight / sample_weight.sum()).sum()
loss.mean().backward()

You could unsqueeze the additional dimension and use broadcasting for the multiplication:

sample_weight = sample_weight.view(-1, 1, 1)
loss =(loss * sample_weight / sample_weight.sum()).sum()

Thanks. I used above code and it got error in backward()

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-22-88aafe38e0f1> in <module>()
     15 loss =(loss * sample_weight / sample_weight.sum()).sum()
     16 print (sample_weight.shape, loss.shape)
---> 17 loss.mean().backward()
     18 
     19 #loss_total = torch.mean(loss * weights)

1 frames
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     91     Variable._execution_engine.run_backward(
     92         tensors, grad_tensors, retain_graph, create_graph,
---> 93         allow_unreachable=True)  # allow_unreachable flag
     94 
     95 

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Your code snippet doesn’t include a tensor with requires_grad=True in the loss calculation.
This would make the code work:

pred = torch.rand((b, num_class, h, w), dtype=torch.float, requires_grad=True)
1 Like

OMG, it worked. But I think

loss =(loss * sample_weight / sample_weight.sum()).sum()
loss.mean().backward()

Should be

loss =torch.mean(loss * sample_weight)
loss.backward()

By removing .sum() in the last and use torch.mean(). Am I correct?

To verify it, I used same_weight is 1 then it should be same as reduction=‘mean’

import torch
import numpy as np
import torch.nn as nn
num_class =2
b,h,w =4,8,8
input = torch.randn((b, 1, h, w), requires_grad=True)
target = torch.empty((b, h, w), dtype=torch.long).random_(num_class)
pred = torch.rand((b, num_class, h, w), dtype=torch.float, requires_grad=True)
criterion = nn.CrossEntropyLoss(reduction='none')
loss = criterion(pred, target)

sample_weight = torch.from_numpy(np.asarray([1, 1, 1, 1])).float()
sample_weight = sample_weight.view(-1, 1, 1)
loss1 =torch.mean((loss * sample_weight))
print (loss1)

loss2 =(loss * sample_weight / sample_weight.sum()).sum()
loss2 = loss2.mean()
print (loss2)

criterion = nn.CrossEntropyLoss()
loss = criterion(pred, target)
print (loss)

Output

tensor(0.7202, grad_fn=<MeanBackward0>)
tensor(46.0946, grad_fn=<MeanBackward0>)
tensor(0.7202, grad_fn=<NllLoss2DBackward>)
1 Like

Does this ‘per-class per-sample weighing’ approach improve performance on imbalanced dataset compared to static weighing?

No, the weight argument is optional.

It will be used to weight the class losses as given by the formula in the docs.

Yes, this should work but would of course not add weighting what this topic was about.

nb_classes=1 in nn.CrossEntropyLoss wouldn’t make sense since your model would only predict a single class and could thus never be wrong. nn.CrossEntropyLoss is used for multi-class classification/segmentation use cases.

How can the per sample weighting be implemented if you have a test set?
Would it be something like:
pred = model(X)
test_loss += loss_fn(pred, y).item()
test_loss += (test_loss * sample_weight / sample_weight.sum()).sum()

Can you just have criterion = nn.CrossEntropyLoss(reduction='none')
Am I right in thinking this would apply just the per sample weighting (and not the per class weighting)?

And for per sample weighting, do the weights need to be appended to the input data / X?:
x = torch.cat((x, sample_weight), dim=1)

You shouldn’t be running the loss function on the test set. The test set should always be for evaluation only—never training.

Thanks, so I can keep it as:
test_loss += loss_fn(pred, y).item()
In the testing section?

Sorry, I made an assumption about code that you didn’t mention. Calculating the loss would be the same on the training/validation/testing sets.

(However, just make sure that you aren’t computing gradients and updating the weights during the testing set evaluation).