Processing unfolded tensors 2x slower?

I’m using the unfold operation on a temporal sequence of data in order to generate a bunch of cases for a model. I send the original temporal sequence to the GPU, and I found that with careful attention to indices, I could maintain “pure views” of the data so that there was very little (no?) GPU memory overhead. Yay!

But then I happened to violate the pure views, and triggered a copy of the unfolded tensor (which you can also do by calling clone() on the views)… and my model ran 2x faster! And GPU memory usage was huge, of course!

See the code below for a simple benchmark.

I don’t have a good hypothesis for why processing a view is so much slower. I thought maybe the index information was stored on the CPU, which would make it slow for the GPU to fetch… but there is the same 2x slowdown on CPU.

On GPU
3.34 ms ± 15.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
6.43 ms ± 316 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) ← view is 2x slower!

On CPU
223 ms ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
410 ms ± 35.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ← view is 2x slower!

Benchmark code:

import torch

data_length = 1_000_000
data_width = 10
sequence_length = 50
#device = 'cuda'
device = 'cpu'

print(f"Initial CUDA memory allocated: {torch.cuda.memory_allocated(0)}")
d = torch.rand((data_length, data_width), device=device)
print(f"After d:                       {torch.cuda.memory_allocated(0)}")
d_unfold = d.unfold(dimension=0, size=sequence_length, step=1).mT             # <-- note the mT
print(f"After d_unfold:                {torch.cuda.memory_allocated(0)}")
# Without the matrix transpose (above), flattening d_unfold triggers a clone
d_unfold_flatten = d_unfold.flatten(start_dim=1)
print(f"After d_unfold_flatten:        {torch.cuda.memory_allocated(0)}")
d_unfold_flatten_clone = d_unfold_flatten.clone()
print(f"After d_unfold_flatten_clone:  {torch.cuda.memory_allocated(0)}")

print(f"{d.shape=}")
print(f"{d_unfold.shape=}")
print(f"{d_unfold_flatten.shape=}")
print(f"{d_unfold_flatten_clone.shape=}")

model = torch.nn.Linear(in_features=data_width * sequence_length, out_features=16, device=device)

print(f"{device=}")
%timeit y = model(d_unfold_flatten_clone)
%timeit y = model(d_unfold_flatten)