Constant subtraction operation takes a lot of time

When I was using Insightface, I found that sometimes the function sample took up a lot of time, resulting in low GPU utilization. I used line_profiler to check the time of each line, and found that this line of code runs very slowly

https://github.com/deepinsight/insightface/blob/master/recognition/partial_fc/pytorch/partial_classifier.py#L56

unreasonable result:

reasonable result (I split this line into two lines):

How can I solve this problem? Thanks.

total_label is distributed in multi gpus, and its shape is (batches * ngpus, embedding_size)