ROC, AUROC, TP, TN, FP, FN, Specificity, Sensitivity are performance metrics typically defined for binary classification. Consequently, these are all frequently used in medicine to measure the performance of binary medical tests/exams on the presence/absence of one pathologic sign or disease.
With the information you gave, your problem looks like a multi-labels and multi-class problem. Each study can probably have 0, 1 or more positive diseases with a multi-hot encoding instead of a one-hot encoding in a single label and multiclass problem. Consequently, you can’t directly extract binary performance metrics from a multi-labels and multi-class problem. Training a model with a final softmax activation function and a categorical cross-entropy loss can result in confusing metrics for this kind of multi-labels multi-class problem. Using a sigmoid activation function with binary cross-entropy loss should allow a more independent result between diseases. In your binary study|Truth|Predict format, a Truth value of 1 could be interpreted by the presence of any disease (not normal case) in the study but your model could inappropriately find a different disease with a Predict value of 1 even if the result looks good.
That said, here are some potential hints to extract binary performance metrics for your 14 multi-labels, multi-class problem :
- Create 14 different binary problems, one for each disease, and evaluate your binary metrics for each specific disease (disease vs no disease)
- TP, TN, FP, FN for a specific disease need a cutoff value over the probability, usually 0.5 if trained with sigmoid activation function.
- ROC, AUROC can also be compute for each disease (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)
- If you want to give a very simple representation of the multi-class performance of your model you could average all the AUROC results into one value, but that isn’t exactly valid statistically.
- Of course, you could also plot a confusion matrix that is a better representation for multi-class multi-labels problem.
Of course, sklearn.metrics should help for implementation. Let me know if you have a problem or if you have any other question.