Here’s the problem, I found in torch profiler timeline file that compare operations takes more time launching cuda kernel. Here’s a pic comparing aten::gt kernel launching time with another operation:
is this normal ? if so, why?