How could one do both per-class weighting (probably CrossEntropyLoss) -and- per-sample weighting while training in pytorch?
The use case is classification of individual sections of time series data (think 1000s of sections per recording). The classes are very imbalanced, but given the continuous nature of the signal, I cannot over or under sample. And, they cannot be analyzed in isolation, as information from surrounding sections is necessary for classification of each section.
The other problem is sometimes individual sections of the time series will be junk (think: pure noise, or no signal -which I can easily quantify during pre-processing). Therefore, although the network will try to classify that section, I want to give it a weight of zero, so that no error is propagated for the network being unable to classify an unclassifiable section.
For the class weighting I would indeed use the weight argument in the loss function, e.g. CrossEntropyLoss.
I assume you could save a tensor with the sample weight during your preprocessing step.
If so, you could create your loss function using reduction='none', which would return the loss for each sample. Using this you could return your sample weights with your data and target for the current batch, multiply it with the loss, and finally calculate the average before calling backward().
Ah, that sounds right. Let me repeat this back to make sure I’m on the same page.
I’d have a network output 3D tensor of (R recordings, C classes, S samples). CrossEntropyLoss, with reduction=‘none’ and a class_weight tensor of C classes, would return a 2D tensor of losses in (R recordings, S samples). Then I would multiply each R by the the unique sample_weight 1D tensor for that R. And finally average this before calling backward().
Does that sound correct? Btw, each recording has a different number of samples. Which, if I understand the benefits of the dynamic graph in pytorch, shouldn’t matter.
I’m still wrapping my head around moving from keras to pytorch. Thank you.
That sounds right!
I’m not sure, what S samples are in your example, but here is a small dummy code snippet showing, what I mean:
batch_size = 10
nb_classes = 2
model = nn.Linear(10, nb_classes)
weight = torch.empty(nb_classes).uniform_(0, 1)
criterion = nn.CrossEntropyLoss(weight=weight, reduction='none')
# This would be returned from your DataLoader
x = torch.randn(batch_size, 10)
target = torch.empty(batch_size, dtype=torch.long).random_(nb_classes)
sample_weight = torch.empty(batch_size).uniform_(0, 1)
output = model(x)
loss = criterion(output, target)
loss = loss * sample_weight
Do you mean each batch has a different size or what exactly are your samples?
Could you post a random tensor showing one sample batch?
EDIT: Probably it’s also a good idea to normalize the sample weights so that the range of the loss will approx. have the same range and won’t depend on the current sample distribution in your batch.
loss =(loss * sample_weight / sample_weight.sum()).sum()
I’m not sure in what range your weights are, so maybe it’s not necessary.
Wow, thank you. The code helps me understand better what you were saying about how to actually implement the operation at the end.
Let’s see… I’m trying to think ahead to the final model, so bear with me. Also, I shouldn’t have used the word sample the way I did. Hopefully I explain it better below.
For each of the N samples (each individual recording), there are D divisions (the number of chunks that the recording is divided into (given that each recording is a different length, this number will vary), and four 1D features (each feature is a different length, and enter the graph at a different location - at least, that’s the plan). So, I see that as 4 input tensors for each sample/recording: (D,w), (D,x), (D,y), (D,z).
For each sample/recording, the output would be a 2D tensor after the softmax of (D, C).
Then, I would have a 1D class-weight tensor of 1xC classes, and for each sample, a 1D division(sample)-weight tensor of 1xD divisions.
Exactly! The 1D division(sample)-weight tensor would be also returned from the DataLoader or do you need to calculate and load it from “somewhere else”?
As far as I understand the divisions vary based on some criteria of your recording.
Would you want to process all divisions of a recording in a single batch or is a “windowed” approach plausible?
A sliding window approach can sometimes be a bit tricky so don’t hesitate to ask here for some hints.
The division(sample)-weight will be calculated for every sample/recording ahead of time, so I guess it can also be returned with DataLoader.
The division count varies just because each recording is a different length, with the largest being about 60% longer than the smallest. It’s just the nature of the data.
So, in my previous models using keras, I did use a windowed approach. However, because tensorflow requires the complete tensor to exist ahead of time, this was a massive waste of memory, since almost all of the data is duplicated using a sliding window. And this massive waste is what led to keeping the windows smaller than I would’ve liked; since the model could only see a small window at a time, it could never grasp the long-range trends and cycles that happen in the data. I’m exploring a combination of spatial pyramid pooling (SPP) and the [TCN] architecture as a possible solution to the issue that the 1st feature for each division is very long (and in its unprocessed form, also varies in length (the sampling rate is different)), and I would also like the network to train on the entire recording at the same time, so it can “see” the long-range cycles.
Hi again @ptrblck
I have been thinking about this answer and I’m confused about a few situations where this solution could potentially be problematic.
Consider a situation where a few samples have a weight of 0. That is, intuitively, those samples (or observations or entries) are meaningless.
loss = loss * sample_weight could result in a loss of 0 for those particular samples. Hence during gradient computation, wouldn’t the network technically be looking at a loss of 0 (perfect classification / regression) for samples which we weren’t confident to begin with? Do you think backprop on a loss of 0 could result in network crashes?
Furthermore, wouldn’t sample_weight.requires_grad be true? Do you think it could create complications during backprop?
I’ve been doing something similar and my network almost always crashes mid training.
@Rakshit_Kothari, I cannot speak to if this -could- cause a crash, or if sample_weight -should- require_grad. However, I am confident, as I trained several hundred networks using this method, that it worked for me.
Thanks. I used above code and it got error in backward()
RuntimeError Traceback (most recent call last)
<ipython-input-22-88aafe38e0f1> in <module>()
15 loss =(loss * sample_weight / sample_weight.sum()).sum()
16 print (sample_weight.shape, loss.shape)
---> 17 loss.mean().backward()
19 #loss_total = torch.mean(loss * weights)
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
92 tensors, grad_tensors, retain_graph, create_graph,
---> 93 allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn