On page 3 of this paper: https://arxiv.org/pdf/1511.02251.pdf there is a multi-class logistic loss:

With the accompanying text:

*The multi-class logistic loss minimizes the negative sum of the log-probabilities over all positive labels. Herein, the probabilities are computed using a softmax layer.*

We know that multi-class means “a sample can be one of N classes where N>2”.

The loss takes a sum of positive labels (`y_nk`

is `0`

if the target is negative and `1`

if the target is true), so why is this loss called a **multi-class** loss? The way I see it, this should be called a **multi-label** loss, since it will learn from multiple true labels (if an input has multiple true labels)?