Performance of contiguous vs. non-contiguous tensors

CDhere · December 27, 2020, 11:49pm

Hi all, I work mostly on computer vision problems and I’ve found that in CV-related computation there are usually tons of tensor flipping (e.g., reshape, switching axes, adding new axes), which might result in non-contiguous tensors (a super good explanation here). Sometimes people will deliberately try to keep tensors to be contiguous, for example the following line from the popular detectron2’s detectron2.data.dataset_mapper.DatasetMapper:

dataset_dict["image"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))

is making sure that an image tensor with contiguous memory is created.

But most of the time people don’t care much about keeping them contiguous, with crazy tensor flipping all over the place (no blaming but just to describe how people make full use of the flexibility provided by the framework lol).

I wonder if there are any general guidelines in dealing tensor’s memory? Is it always better to use contiguous tensors? Thanks!

googlebot · December 28, 2020, 11:53am

Thing is, operations on permuted tensors create contiguous tensors, so permuted tensors usually don’t stick around. One notable exception is torch.channels_last format, added some time ago, that one is “sticky”.

CDhere · December 29, 2020, 6:47am

Could you please give an example on how new contiguous tensors are created? I’ve tried timing forwarding contiguous vs. non-contiguous tensors, and forwarding contiguous tensors has lower latency than the other.

googlebot · December 29, 2020, 11:17am

x=torch.ones(2,3,4).permute(0,2,1)
(x * 2).stride()
Out[54]: (12, 1, 4)
(x.log()).stride()
Out[55]: (12, 1, 4)

So, elementwise ops don’t create contiguous tensors, but I believe they have the same speed on non-contiguous tensors (at least unary and scalar operand ops)

(x.matmul(torch.ones(3,5))).stride()
Out[56]: (20, 5, 1)
F.conv1d(x, torch.ones(5,4,1)).stride()
Out[58]: (15, 3, 1)

Contiguous tensors are created by usual “layer” ops. They can work slower on non-contiguous tensors, yes.

However:

nhwc = torch.ones(2,3,4,5,device="cuda")
nchw = nhwc.permute(0,3,1,2)
(nchw.stride(), nchw.is_contiguous(), nchw.is_contiguous(memory_format=torch.channels_last))
Out[69]: ((60, 1, 20, 5), False, True)
F.conv2d(nchw, torch.ones(6,5,1,1,device="cuda")).stride()
Out[72]: (72, 1, 24, 6)

here torch.channels_last format is auto-detected for a cuda 4d tensor, and is preserved