Predict Multiple Binary feature

Hey

I would like to add manualy an output layer to BERT in order to predict multiple features which are binary.

For example, these outputs would answer the questions:

• Is the text positive? 1 if yes, 0 otherwise.
• Is this text about sports? 1 if yes, 0 if not.
• Is this text about business? 1 if yes, 0 if not.

My first idea was to do it as a regression task, adding an output layer with 3 neurons: one for each question.

The pseudo-code:

``````def __init__(self):
self.bert = CamembertModel.from_pretrained('camembert-base')
self.regressor = nn.Sequential(nn.Linear(dim_in, 3))

def forward(self, input):
outputs = self.bert(input) # "Bert Layers"
outputs = self.regressor(outputs) # LinearSequential with output of 3
return outputs
``````

But Iâ€™ll get values above 1 and below 0. So, which kind of layer could I add in order to get probability (values between 0 and 1) to deal with this problem ?

I hope my question was clear, thank you for your help.

``````m = nn.Softmax(dim=1)
input = torch.randn(4, 3)
output = m(input)

# Output:
#tensor([[0.7552, 0.0566, 0.1882],   --> adds up to 1
#        [0.0267, 0.6366, 0.3367],
#        [0.0937, 0.8218, 0.0845],
#        [0.1545, 0.6051, 0.2404]])
``````

Softmax is good if the probabilities sum to 1. In your example though perhaps they donâ€™t - a text could be both about business and positive. You can try a nn.Sigmoid() layer, whose output is always in the interval [0,1].

2 Likes

You are right. Thanks for the correction

Hi Te!

As you describe it, you are performing a multi-label, multi-class
classification. It is multi-class because you have three classes:
â€śpositive,â€ť â€śsports,â€ť and â€śbusiness.â€ť It is multi-label because for
any given sample, none, some, or all of the classes can be active,
so each sample can be labelled with multiple classes at the same
time.

As you have recognized, this is, in essence, three binary classification
problems that are run through your network at the same time.

The most typical approach for such a problem is to have your final
layer be a `Linear` with `out_features = 3` (your number of classes)
and to use `BCEWithLogitsLoss` as the loss criterion.

The output of your network (which is the output of your final `Linear`
layer) should be interpreted as raw-score logits, one (per sample)
for each of your three classes.

If you need to convert the logits to probabilities (and you usually
donâ€™t), you do so by passing them through a `sigmoid()` function.

Note, however, you should pass the logits directly to `BCEWithLogitsLoss`,
without converting them to probabilities. `BCEWithLogitsLoss` expects
logits and has, in effect, a `sigmoid()` built into it. (For reasons of
numerical stability, you should not use `BCELoss`, which does expect
probabilities.)

If you do need to convert your logits to probabilities, for whatever reason,
you should do it after computing your loss criterion, and not have it be