Hello. Thanks for reading. I have a dataset with different chemical assays. Each assay can be pos=1, negative = 0 and not measured =-1I have now implemented a fnn model with ReLU activation function. I tried mse loss and now I tried crossentropy loss. With mse my loss is super low really fast, but the predictions return nonsense. No value is near 0.9 or even 1. Some if the assays for sure are positive for example. With crossentropy I now have nehative loss. I now thought it is time to ask for help. All input and tipps would be helpfull to me.
Based on your description it seems you are working on a multi-class classification use case with 3 classes. In this case nn.CrossEntropyLoss would be the right loss function.
This loss function expects a model output in the shape [batch_size, nb_classes=3] and a target tensor in the shape [batch_size] containing class indices in the range [0, nb_classes-1], so [0, 1, 2] in your case.
You would have to map your current labels from [-1, 0, 1] to [0, 1, 2] which might also explain the negative loss you are seeing.
Thank you for your answers. I tried to use a mask now to mask out all “not measured” classes. I have 11 classes where they can be either 0 or 1 now. (Sorry for not explaining very well)
I use 3 Linear layers, ReLU between and I am wondering now should I switch them all for Softmax or only the last? Before the last Liner or after? I am not sure how to make use of the pos_weight in CreL, since i have 11 classes and don’t know which one of them is 0 or 1. I used a Boolean mask for now before calculating the loss.
My new problem is now. that the loss is only fluctuating, not decreasing. It stays around 1200 loss which i have never seen before. I can´t solve my problem. Maybe I should just stop.
Thank you anyways for trying to help me. I am still open to inputs if someone has time for me.
assay 1 assay 2 assay 3
compound1: 0 -1 1
ReLU as an activation on intermediate layers serves a different purpose, giving your model the ability to be decisive. I.e. if you think of the model as a game of 20 questions: “Is it red?” An activation layer turns a spectrum of red/green into a deterministic yes or no. You want that kind of objectivity in a model.
This should not be the case since nn.CrossEntropyLoss expects raw logits. Internally it will apply F.log_softmax and nn.NLLLoss, so applying another F.softmax would use: loss = F.nll_loss(F.log_softmax(F.softmax(output, 1), 1)) and would thus decrease the magnitude of the gradients.
nn.CrossEntropyLoss does not provide the pos_weight argument, but nn.BCEWithLogitsLoss does, which is used for a multi-label or binary classification.
This sounds indeed like a multi-label classification use case where each sample can belong to zero, one, or multiple classes. If so, you should use nn.BCEWithLogitsLoss instead and pass the raw logits to the loss function.
The weight should be based on your targets/labels. For example, let’s suppose you have only half the training samples for class 0 as other classes, you’d set that value to 2 in the weight tensor, while the others would be 1. The weight is just a tensor of multipliers to get the classes equal. It should be the same length as your number of classes. It’s only necessary if your training samples have a significant class imbalance.
One more possible strategy you can try is initializing the biases in your layers to 1s. For example, you could add something like this in your init after making the layers:
Correct. But in another thread, adding a softmax activation on the outputs before CEL provided consistently better inference on randomly initialized weights, whereas without would sometimes overfit to the targets and sometimes not. I have admittedly not dug in to the reasons why that was the case, but could be worth trying if other suggestions aren’t working. Here is that thread(with a reproducible code snippet):
After retesting the code snippet on that other thread, it seems softmax doesn’t work whatsoever. It must have been sigmoid that I used with good results in that example. I am currently getting decent results based on using sigmoid for that example. I’ve updated that thread accordingly.
On further testing, setting the activation to sigmoid also was hit and miss, although it seemed slightly better than no final activation for that example. Setting the biases to 1s had very good results every time. I will remove the comments about adding a sigmoid activation function.