Why torch.unique() so slow

seg = nnunet.predict(data)
old_seg = seg
# print(old_seg.equal(seg)) # True
lable = torch.unique(seg)

When running the above code, a problem was encountered:
Without adding the third line of code, the torch.unique function takes 100ms to run. after adding the third line of code, it runs in only 10ms, but the equal judgement takes 100ms. what is the cause of this, and how should I improve my code to speed up the time.
Where: seg is the segmentation result by nnunet inference, data type is tensor, on cuda:0, int8

CUDA operations are executed asynchronously and accessing results of these operations will synchronize your code. Either manually synchronize the code before starting and stopping the host timers or use torch.utils.benchmark which will synchronize the code for you and add warmup iterations.