Why dose aten::gt operation takes more time launching kernel?

Here’s the problem, I found in torch profiler timeline file that compare operations takes more time launching cuda kernel.
Here’s a pic comparing aten::gt kernel launching time with another operation:

is this normal ? if so, why?