Best Practice: Segmentation evaluation metrics

Hi @all,

I have a more general question concerning the evaluation metrics for segmentation. Which ones are commonly used or do you see as ‘best practice’ for researchers? Which one would you include in publications?

I am talking more about comparability than efficiency. I am aware that the use of metrics can be task dependent. Some applications will best work with very specialized or custom methods. I am also aware that the selection of metrics can be predefined by cirumstances (e.g.Kaggle competition). But if I am new to (semantic) segmentation, which metrics should I implement in my code to evaluate the outcome of my segmentation model?

Here is a non-comprehensive list of metrics that I’ve found (and partially used) so far. I did not always include ‘statistical variations’ like mean/average, median, standard deviation, etc… For a deeper look into the topic see: Taha, Abdel Aziz, and Allan Hanbury. “Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool.” BMC medical imaging 15.1 (2015): 29.

Metrics that seem to be commonly used (e.g. [1] [2])

Based on confusion matrix

  • Global/Per_Class Accuracy
  • Precision
  • Recall
  • Intersection over Unit(IOU) = Jaccard Index
  • F1 score = Sørensen–Dice coefficient (Dice)

Metrics that I rarely encounter

Based on confusion matrix

  • Cohen’s kappa / Observed Accuracy
  • Global Consistency Error (GCE) / Local Consistency Error(LCE)

Base on pair counting

  • Rand Index / Adjusted Rand Index

Based on spatial distance

  • Hausdorff Distance and weighted Hausdorff Distance

There are lots of more metrics out there. But did I miss some major ones? Which of these do you suggest to use for starters? And which ones do you use?

As you said, it’s very task-dependent. In medical application, I would say most papers are using dice coefficient, Jaccard index (iOU), Hausdorff Distance. You need to check what are the most common metrics in your field.

1 Like