Pytorch Benchmark docs Issue

https://pytorch.org/tutorials/recipes/recipes/benchmark.html#comparing-benchmark-results

The above links shows comparison of time taken by two different implementation of torch.dot method mul/sum and bmm. The table shows bmm takes longer when num_threads > 1, but after table, docs mention: “The results above indicate that the version which reduces to bmm is better for larger tensors running on multiple threads, while for smaller and/or single thread code, the other version is better.”

However, bmm mostly outperferms mul/sum when running on larger tensors on a single thread. Did I miss anything here?