Multilabel classification: How to binarize scores? (How to learn thresholds?)

NLPsimon · September 18, 2018, 1:21pm

Hi PyTorchers,

I’ve been using PyTorch for smaller tasks for a while and want to do a multilabel classification now for the first time. My task is to assign a sentence an arbitrary subset of 11 possible labels/classes. So my output should be a vector with 11 binary entries (0 = class not detected, 1 = class detected).

In order to do so, I have a LSTM that takes the sentence word by word (encoded by word2vec) and feeds its last output to a linear layer which is then returned by the model. So the output of my model is a vector with 11 float values. I am not applying softmax, sigmoid or anything else.

My model looks like this:

class LSTM(nn.Module):
    def __init__(self, hidden_dim, tagset_size):
        super(LSTM, self).__init__()
        self.hidden_dim = hidden_dim
        self.layers = 1
        self.dropout = 0.0

        word2vec = KeyedVectors.load('word2vec.vocab', mmap='r')
        self.word2idx = lambda word: word2vec.vocab[word].index if word in word2vec.vocab else 0 
        self.sent2idx = lambda sent: [self.word2idx(word) for word in sent.split(' ')]
        embedding_weights = torch.FloatTensor(np.array(word2vec.wv.syn0))
        num_embeddings, embedding_dim = embedding_weights.shape
        self.w2v_emb = nn.Embedding.from_pretrained(embedding_weights, freeze=True)

        self.bidir = True
        self.dirs = 1 + int(self.bidir)
        self.clear_hidden(n=chunk_size)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=self.layers, bidirectional=self.bidir, batch_first=True, dropout=self.dropout)
        self.hidden2tag = nn.Linear(self.dirs * hidden_dim, tagset_size)

    def clear_hidden(self, n):
        self.hidden = (torch.zeros(self.dirs * self.layers, n, self.hidden_dim).to(device),
                       torch.zeros(self.dirs * self.layers, n, self.hidden_dim).to(device))

    def forward(self, sentence):
        idxs_unpadded = [torch.tensor(x, dtype=torch.long) for x in list(map(self.sent2idx, sentence))]
        lengths = [len(x) for x in idxs_unpadded]
        idxs_padded = pad_sequence(idxs_unpadded, batch_first=True, padding_value=0)

        idxs = torch.tensor(idxs_padded, dtype=torch.long).to(device)
        embeddings = self.w2v_emb(idxs).cpu()
        embeddings, lengths, perm = prepare_for_lstm(embeddings, lengths)
        embeddings = embeddings.to(device)
        lstm_out, self.hidden = self.lstm(embeddings, self.hidden)

        results, lengths = unpack_lstm_output(lstm_out)

        last_outputs = torch.stack([result[length.data.tolist()-1] for result,length in zip(results, lengths)])
        tag_space = self.hidden2tag(last_outputs)

        return tag_space, perm

My questions now are:

Which loss function should I use? I read different opinions on the web: BCEWithLogitsLoss, MultiLabelMarginLoss, CrossEntropyLoss, etc.
How do I convert the float output to a binary output? I need to find some threshholds, right? Is it possible to find these via end-to-end learning, i.e. the model should learn the thresholds itself and can directly output 11 true/false values?
Is there a full working example on how to do multilabel classification INCLUDING how to binarize outputs and how to evaluate the model? (Accuracy, jaccard, etc)
If you see something else that I am doing wrong, please let me know. I am still quite new to PyTorch

Would be very nice if someone could help me a bit

Best,
Simon

ptrblck · September 18, 2018, 6:36pm

You could use BCEWithLogitsLoss. I’ve created a small dummy example in another thread.

Your model will output logits, which you can feed into a sigmoid layer.
The choice of the threshold depends on your use case, e.g. some classes should have a high sensitivity while others a high specificity.
Have a look at the scikit-learn explanation of multi-class ROCs.

NLPsimon · September 19, 2018, 6:53pm

Thank you! As you suggested, I am now using BCEWithLogitsLoss and the output of my model is a linear layer without a activation function after it. It seems to work quite good.

But I have a general question. I tried two approaches:

Train only on train set and tune thresholds on the validation set
Train on train + validation set and tune thresholds on these sets as well

Is approach 2 prone to overfitting? Are there any reasons not to do this? I get slightly better results when I do it like this. Of course I have also a test split which I never see during training or threshold tuning. I evaluated both approaches on the test set and get better results with #2. This is most likely due to more training samples during training.

NLPsimon · September 21, 2018, 7:50am

Does someone maybe know more about the last question regarding the splits?

ptrblck · September 22, 2018, 1:49pm

Your second approach introduces a data leak and should not be used.
The validation set is used as a proxy for your test data, so using in to train your model is not a good idea.
I would recommend to stick to the first approach.

NLPsimon · September 24, 2018, 8:36am

Hi,
okay, thank you. But why would it introduce a data leak? I mean I never see the test data during training or hyperparameter tuning.

Anyways, I have tried to just train with the train data and it works fine (probably because the validation set is only 10% of the training set, so it does not make a big difference). However, what surprises me, is that if I do not learn thresholds, but just use 0.5 as threshold for every label, I get better test results than taking the middle of the averages of positive and negative samples as a threshold. How can that be possible? Is sigmoid intended to be used with a threshold of 0.5 and there is no need to finetune the thresholds? I find it hard to find any resoruces on how to binarize sigmoid outputs, so I am just trying and guessing right now.

ptrblck · September 24, 2018, 11:23am

You are right. Data leak might be wrong. Instead you’ll get biased estimates of your test accuracy, since the validation data was already touched.

What is your metric to see if your results are better?
Usually you would check the ROC with various thresholds to see where to cut off.

mratsim · September 24, 2018, 3:00pm

There was the Amazon Forest competition with F2 score loss which needed actual label and not scores as output.

To have adaptative threshold, I’ve used Basinhopping optimizer + L-BFGS on a MultiLabelMarginLoss output

I’ve explored various way to avoid thresholding, including creating a RNN that would output labels but you should not that there is no easy answer, many papers are being published on that, you can start with the following:

A study on Threshold Selection for Multi-Label Classification
Thresholding classifiers to maximize F1 score and Optimal thresholding for F1 measure
Optimizing F-Measure a tale of 2 approaches
All captioning, pictures to text/label, object detection are also dealing with that and worth checking

Note that Sigmoid/MultiLabelMarginLoss assume the classes probabilities are independent which is usually not true.