How should i interpret the output probabilities

I have a dataset with a binary outcome - 0 if disease is absent and 1 if disease is present

I have a CNN implemented but returns AUROC for each of the 14 diseases. i want to extract the predicted value for each of the study – to output

study | Truth | Predict
1        | 0       | 0.08

My question is on how to interpret the probabilities – when should i say that the machine was correct or incorrect

Here is an example of the output from one study

[ 0.09663379  0.36627278  0.03541835  0.08720721  0.02466215  0.04100307
   0.02745389  0.05987659  0.07339925  0.13602285  0.06774765  0.00617875
   0.02206573  0.03071797]

The ground truth in this case would be

0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.

meaning that this study is normal

How can i extract True positive | TN | FP | FN

ROC, AUROC, TP, TN, FP, FN, Specificity, Sensitivity are performance metrics typically defined for binary classification. Consequently, these are all frequently used in medicine to measure the performance of binary medical tests/exams on the presence/absence of one pathologic sign or disease.

With the information you gave, your problem looks like a multi-labels and multi-class problem. Each study can probably have 0, 1 or more positive diseases with a multi-hot encoding instead of a one-hot encoding in a single label and multiclass problem. Consequently, you can’t directly extract binary performance metrics from a multi-labels and multi-class problem. Training a model with a final softmax activation function and a categorical cross-entropy loss can result in confusing metrics for this kind of multi-labels multi-class problem. Using a sigmoid activation function with binary cross-entropy loss should allow a more independent result between diseases. In your binary study|Truth|Predict format, a Truth value of 1 could be interpreted by the presence of any disease (not normal case) in the study but your model could inappropriately find a different disease with a Predict value of 1 even if the result looks good.

That said, here are some potential hints to extract binary performance metrics for your 14 multi-labels, multi-class problem :

  1. Create 14 different binary problems, one for each disease, and evaluate your binary metrics for each specific disease (disease vs no disease)
  2. TP, TN, FP, FN for a specific disease need a cutoff value over the probability, usually 0.5 if trained with sigmoid activation function.
  3. ROC, AUROC can also be compute for each disease (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)
  4. If you want to give a very simple representation of the multi-class performance of your model you could average all the AUROC results into one value, but that isn’t exactly valid statistically.
  5. Of course, you could also plot a confusion matrix that is a better representation for multi-class multi-labels problem.

Of course, sklearn.metrics should help for implementation. Let me know if you have a problem or if you have any other question.

1 Like

I have this type of problem for a multi-class and multi-label problem. Do you have any code samples? My data is similar to the one in the question.