Tracing: Confusion with duration of similar `flip` operations

I was analyzing a tracing from one of my experiments. At a given point I have two consecutive aten::flip operations with the respective input dimensions:

  1. [[128, 128, 64, 6], []]
  2. [[128, 128, 64, 13, 6], []]

This would be the code associated with it

    Y = torch.nn.functional.pad(Y, padding)
    if Y.ndim == 5:
      Y[:, :, :,  0, m:] = _Y[:, :, :, 0, 1:1+p].flip([-1])
      Y[:, :, :, 1:, m:] = _Y[:, :, :, 1:, 1:1+p].flip([-2, -1]);
    elif Y.ndim == 4:
      Y[:, :,  0, m:] = _Y[ :, :, 0, 1:1+p].flip([-1])
      Y[:, :, 1:, m:] = _Y[ :, :, 1:, 1:1+p].flip([-2, -1]);
    else:
      raise ValueError('Wrong Shape for Y')

For some reason the first takes more time to execute than the second. This is counter-intuitive for me given the first has less elements. Why would this be?

Here are some screen shots from the tracing.


I am pretty new to tracing and understanding all the dynamics of GPU kernel dispatches and optimization. Is there anything I am missing that would make things more clear here?

I cannot reproduce the results using this benchmark code:

def fun1(Y, p=3, m=-3):
    Y[:, :, :,  0, m:] = Y[:, :, :, 0, 1:1+p].flip([-1])
    Y[:, :, :, 1:, m:] = Y[:, :, :, 1:, 1:1+p].flip([-2, -1])
    return Y


def fun2(Y, p=3, m=-3):
    Y[:, :,  0, m:] = Y[ :, :, 0, 1:1+p].flip([-1])
    Y[:, :, 1:, m:] = Y[ :, :, 1:, 1:1+p].flip([-2, -1])
    return Y

Y1 = torch.randn([128, 128, 64, 13, 6], device='cuda')
Y2 = torch.randn([128, 128, 64, 6], device='cuda')

t1 = torch.utils.benchmark.Timer(stmt="fun1(Y1)", globals=globals())
t2 = torch.utils.benchmark.Timer(stmt="fun2(Y2)", globals=globals())

t1.blocked_autorange()
> <torch.utils.benchmark.utils.common.Measurement object at 0x7fe04040ad30>
  fun1(Y1)
    2.28 ms
    1 measurement, 100 runs , 1 thread
t2.blocked_autorange()
> <torch.utils.benchmark.utils.common.Measurement object at 0x7fe04040a070>
  fun2(Y2)
    Median: 226.38 us
    IQR:    6.76 us (224.14 to 230.90)
    8 measurements, 100 runs per measurement, 1 thread

Could you check, if both tensors are contiguous before passed to the functions?

Hello @ptrblck, I really appreciate that you are taking a look into this. Thank you so much.

So, here is a code snippet for reproducing it.

import torch

def my_not_so_secrete_function(Y, _Y):
  p = Y.shape[-2] - Y.shape[-1]
  padding = (0, p)
  m = Y.shape[-1]

  Y = torch.nn.functional.pad(Y, padding)
  if Y.ndim == 5:
    Y[:, :, :,  0, m:] = _Y[:, :, :, 0, 1:1+p].flip([-1])
    Y[:, :, :, 1:, m:] = _Y[:, :, :, 1:, 1:1+p].flip([-2, -1]);
  elif Y.ndim == 4:
    Y[:, :,  0, m:] = _Y[ :, :, 0, 1:1+p].flip([-1])
    Y[:, :, 1:, m:] = _Y[ :, :, 1:, 1:1+p].flip([-2, -1]);
  else:
    raise ValueError('Wrong Shape for Y')

  return Y

device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
with torch.profiler.profile(
   record_shapes=True
) as profile:
  for _ in range(8):
    Y = torch.randint(32, size=(128, 64, 3, 32, 17), dtype=torch.float, device=device)
    _Y = torch.randint(32, size=(128, 64, 3, 32, 17), dtype=torch.float, device=device)
    output = my_not_so_secrete_function(Y, _Y)

profile.export_chrome_trace('profile-investigation.json')

I uploaded the profile-investigation.json file here as a Github Gist.