I am working on a binary segmentation problem using Pytorch. I want to know the correct way of calculating metrics like precision, recall, f1-score, mIOU, etc for my test set. From many of the online codes available, I found different ways of calculation, which I have mentioned below:

##### Method 1

- Calculate metrics for individual images separately.
- Compute the sum of each metric of all test images.
- Divide the metrics by the number of test images.

Example:

```
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
Score += calc_metrics(mask1, pred1)
# Score could be any of the metrics
Final_Score = Score/ len(test_x)
```

##### Method 2

- Add TP, FP, TN, and FN pixel count of all the images and prepare a confusion matrix.
- Calculate all metrics at the end using the total TP, FP, TN, and FN pixel count and confusion matrix.

Example:

```
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
FP += np.float(np.sum((pred1==1) & (mask1==0)))
FN += np.float(np.sum((pred1==0) & (mask1==1)))
TP += np.float(np.sum((pred1==1) & (mask1==1)))
TN += np.float(np.sum((pred1==0) & (mask1==0)))
Score = calc_metrics(TP, FP, TN, TP)
```

##### Method 3

Calculate batch-wise metrics and divide them by the number of test images at the end.

##### Method 4

Unlike all the above methods, which use 0.5 as the threshold on the predicted mask (after applying sigmoid), this method uses a range of thresholds and computes metrics on different thresholds for each metric and takes the mean of these values at the end.

Example:

```
for i, (x, y) in enumerate(zip(test_x, test_y)):
...
for t in range(len(threshold)):
thresh = threshold(t)
thresh_metric(t) = calc_metrics(mask1, pred1, thresh)
thresh_final(i,:) = thresh_metric
Score = np.sum(mean(thresh_final))/len(test_x)
```

I am confused about which way to use to report my modelâ€™s results on the test set.