Pytorch Benchmark docs Issue

The above links shows comparison of time taken by two different implementation of method mul/sum and bmm. The table shows bmm takes longer when num_threads > 1, but after table, docs mention: “The results above indicate that the version which reduces to bmm is better for larger tensors running on multiple threads, while for smaller and/or single thread code, the other version is better.”

However, bmm mostly outperferms mul/sum when running on larger tensors on a single thread. Did I miss anything here?