# Tracing: Confusion with duration of similar `flip` operations

I was analyzing a tracing from one of my experiments. At a given point I have two consecutive `aten::flip` operations with the respective input dimensions:

1. `[[128, 128, 64, 6], []]`
2. `[[128, 128, 64, 13, 6], []]`

This would be the code associated with it

``````    Y = torch.nn.functional.pad(Y, padding)
if Y.ndim == 5:
Y[:, :, :,  0, m:] = _Y[:, :, :, 0, 1:1+p].flip([-1])
Y[:, :, :, 1:, m:] = _Y[:, :, :, 1:, 1:1+p].flip([-2, -1]);
elif Y.ndim == 4:
Y[:, :,  0, m:] = _Y[ :, :, 0, 1:1+p].flip([-1])
Y[:, :, 1:, m:] = _Y[ :, :, 1:, 1:1+p].flip([-2, -1]);
else:
raise ValueError('Wrong Shape for Y')
``````

For some reason the first takes more time to execute than the second. This is counter-intuitive for me given the first has less elements. Why would this be?

Here are some screen shots from the tracing.

I am pretty new to tracing and understanding all the dynamics of GPU kernel dispatches and optimization. Is there anything I am missing that would make things more clear here?

I cannot reproduce the results using this benchmark code:

``````def fun1(Y, p=3, m=-3):
Y[:, :, :,  0, m:] = Y[:, :, :, 0, 1:1+p].flip([-1])
Y[:, :, :, 1:, m:] = Y[:, :, :, 1:, 1:1+p].flip([-2, -1])
return Y

def fun2(Y, p=3, m=-3):
Y[:, :,  0, m:] = Y[ :, :, 0, 1:1+p].flip([-1])
Y[:, :, 1:, m:] = Y[ :, :, 1:, 1:1+p].flip([-2, -1])
return Y

Y1 = torch.randn([128, 128, 64, 13, 6], device='cuda')
Y2 = torch.randn([128, 128, 64, 6], device='cuda')

t1 = torch.utils.benchmark.Timer(stmt="fun1(Y1)", globals=globals())
t2 = torch.utils.benchmark.Timer(stmt="fun2(Y2)", globals=globals())

t1.blocked_autorange()
> <torch.utils.benchmark.utils.common.Measurement object at 0x7fe04040ad30>
fun1(Y1)
2.28 ms
1 measurement, 100 runs , 1 thread
t2.blocked_autorange()
> <torch.utils.benchmark.utils.common.Measurement object at 0x7fe04040a070>
fun2(Y2)
Median: 226.38 us
IQR:    6.76 us (224.14 to 230.90)
8 measurements, 100 runs per measurement, 1 thread
``````

Could you check, if both tensors are contiguous before passed to the functions?

Hello @ptrblck, I really appreciate that you are taking a look into this. Thank you so much.

So, here is a code snippet for reproducing it.

``````import torch

def my_not_so_secrete_function(Y, _Y):
p = Y.shape[-2] - Y.shape[-1]
padding = (0, p)
m = Y.shape[-1]

if Y.ndim == 5:
Y[:, :, :,  0, m:] = _Y[:, :, :, 0, 1:1+p].flip([-1])
Y[:, :, :, 1:, m:] = _Y[:, :, :, 1:, 1:1+p].flip([-2, -1]);
elif Y.ndim == 4:
Y[:, :,  0, m:] = _Y[ :, :, 0, 1:1+p].flip([-1])
Y[:, :, 1:, m:] = _Y[ :, :, 1:, 1:1+p].flip([-2, -1]);
else:
raise ValueError('Wrong Shape for Y')

return Y

device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
with torch.profiler.profile(
record_shapes=True
) as profile:
for _ in range(8):
Y = torch.randint(32, size=(128, 64, 3, 32, 17), dtype=torch.float, device=device)
_Y = torch.randint(32, size=(128, 64, 3, 32, 17), dtype=torch.float, device=device)
output = my_not_so_secrete_function(Y, _Y)

profile.export_chrome_trace('profile-investigation.json')
``````

I uploaded the `profile-investigation.json` file here as a Github Gist.