Predict Multiple Binary feature

Hey :slight_smile:

I would like to add manualy an output layer to BERT in order to predict multiple features which are binary.

For example, these outputs would answer the questions:

  • Is the text positive? 1 if yes, 0 otherwise.
  • Is this text about sports? 1 if yes, 0 if not.
  • Is this text about business? 1 if yes, 0 if not.

My first idea was to do it as a regression task, adding an output layer with 3 neurons: one for each question.

The pseudo-code:

def __init__(self):
      self.bert = CamembertModel.from_pretrained('camembert-base')
      self.regressor = nn.Sequential(nn.Linear(dim_in, 3))

def forward(self, input):
      outputs = self.bert(input) # "Bert Layers"
      outputs = self.regressor(outputs) # LinearSequential with output of 3
      return outputs

But I’ll get values above 1 and below 0. So, which kind of layer could I add in order to get probability (values between 0 and 1) to deal with this problem ?

I hope my question was clear, thank you for your help.

You can add a softmax layer after your regressor.

m = nn.Softmax(dim=1)
input = torch.randn(4, 3)
output = m(input)

# Output: 
#tensor([[0.7552, 0.0566, 0.1882],   --> adds up to 1
#        [0.0267, 0.6366, 0.3367],
#        [0.0937, 0.8218, 0.0845],
#        [0.1545, 0.6051, 0.2404]])

Softmax is good if the probabilities sum to 1. In your example though perhaps they don’t - a text could be both about business and positive. You can try a nn.Sigmoid() layer, whose output is always in the interval [0,1].


You are right. Thanks for the correction :smile:

Hi Te!

As you describe it, you are performing a multi-label, multi-class
classification. It is multi-class because you have three classes:
“positive,” “sports,” and “business.” It is multi-label because for
any given sample, none, some, or all of the classes can be active,
so each sample can be labelled with multiple classes at the same

As you have recognized, this is, in essence, three binary classification
problems that are run through your network at the same time.

The most typical approach for such a problem is to have your final
layer be a Linear with out_features = 3 (your number of classes)
and to use BCEWithLogitsLoss as the loss criterion.

The output of your network (which is the output of your final Linear
layer) should be interpreted as raw-score logits, one (per sample)
for each of your three classes.

If you need to convert the logits to probabilities (and you usually
don’t), you do so by passing them through a sigmoid() function.

Note, however, you should pass the logits directly to BCEWithLogitsLoss,
without converting them to probabilities. BCEWithLogitsLoss expects
logits and has, in effect, a sigmoid() built into it. (For reasons of
numerical stability, you should not use BCELoss, which does expect

If you do need to convert your logits to probabilities, for whatever reason,
you should do it after computing your loss criterion, and not have it be
part of your backpropagation.


K. Frank