Average precision (AP) - weird results if val set has easy examples; how to calculate?

I am computing average precision (AP) for object detection as in Pascal VOC dataset. My results were too good and I suspected that I might overlook something. But no, many AP libraries, including Mask R-CNN and Pascal VOC devkit have the issue that a single very simple example in validation set will turn the metric totally useless. This is how I calculate AP:

  1. calculate precision and recall for each image, store the pairs in R
  2. iterate through all n unique recall values r in R and, for each value r, find max precision from R where recall >= r(maximum precision where recall is greater or equal than r) (source)
  3. multiply each max precision value by it’s proportion, i.e., next_recall - current_recall. As recall values are between 0 and 1 this will give a proportion to approximate area under the curve.
  4. sum all results from last step

Now this works fine and gives reasonable results if we assume that precision mostly drops when recall grows. But this is not always a realistic assumption. Given one easy image where precision=1 and recall=1 then everything blows up and AP value for the whole dataset will be 1 (best possible result) even if the algorithm fails on all other examples. This is because each recall gets assigned maximum precision where recall is greater or equal than r. Failing example:

R[recall___] = [0.7, 0.91, 1]
R[precision] = [0.11, 0.10, 1.0]
AP = 1.0

As I understand, Average Precision should be an approximation of area under the curve of precision-recall plot, which this clearly does not achieve. But as all implementation are giving me the same results then I wonder if:

  1. Am I missing something here?
  2. Can we have better and more accurate metric just by redefining maximum precision only where recall is between r and r+1?
  3. And finally, how are so many object detection articles using AP and most never point out how they calculated it?

I have also tested this on offical Pascal VOC MATLAB code from their devkit (MATLAB code):

function ap = VOCap(rec,prec)

mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
    mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;
ap=sum((mrec(i)-mrec(i-1)).*mpre(i));

There is also Pytorch TNT average precision metric - yet a different one, looks like it defines AP for single validation example, not for the dataset as the inputs are output and target (making it hard to use for object detection where you have to calculate IOU and cannot use direct model output).

Or perhaps someone can just explain how you calculate average precision for the whole dataset in object detection? Just use mean of all precision values?

@martinr That is a very smart observation you came across. I have a similar problem that I’ve been grappling with. Did you find a solution to this problem? How did you resolve this please?