After some further debugging I could pinpoint why this happens.

`torchvision.io.image.read_image`

produces a tensor which is the CHW permuted version of a HWC tensor. That means for a `[3, 1, 2]`

sized CHW tensor the strides are `[1, 6, 3]`

.

```
def print_tensor_info(x: torch.Tensor) -> None:
print('size:', x.size())
print('stride:', x.stride())
orig = torch.tensor([
[
[1, 2, 3],
[4, 5, 6],
],
])
print_tensor_info(orig)
# size: torch.Size([1, 2, 3])
# stride: (6, 3, 1)
permuted = orig.permute(2, 0, 1)
print_tensor_info(permuted)
# size: torch.Size([3, 1, 2])
# stride: (1, 6, 3)
```

`unsqueeze`

of this `permuted`

CHW tensor to a NCHW tensor sets an *invalid* batch stride of `3`

:

```
permuted_unsqueezed = permuted.unsqueeze(dim=0)
print_tensor_info(permuted_unsqueezed)
# size: torch.Size([1, 3, 1, 2])
# stride: (3, 1, 6, 3)
```

The correct value would be `6`

. There is only a simple heuristic which chooses that value in https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorShape.cpp#L3191-L3199:

```
InferUnsqueezeGeometryResult
inferUnsqueezeGeometry(const Tensor& tensor, int64_t dim) {
InferUnsqueezeGeometryResult result(tensor.sizes(), tensor.strides());
int64_t new_stride = dim >= tensor.dim() ? 1 : result.sizes[dim] * result.strides[dim];
result.sizes.insert(result.sizes.begin() + dim, 1);
result.strides.insert(result.strides.begin() + dim, new_stride);
return result;
}
```

This `permuted_unsqueezed`

tensor with sizes `[1, 3, 1, 2]`

and strides `[3, 1, 6, 3]`

is treated as having a `Contiguous`

memory format (and not a `ChannelsLast`

memory format): https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/TensorBase.h#L270-L289.

Adding `zeros_like`

yields the following tensor:

```
zeros_like = torch.zeros_like(permuted_unsqueezed)
permuted_unsqueezed_added_zeros = permuted_unsqueezed.add(zeros_like)
print_tensor_info(permuted_unsqueezed_added_zeros)
# size: torch.Size([1, 3, 1, 2])
# stride: (6, 1, 6, 3)
```

Note that the stride `[6, 1, 6, 3]`

is correct this time. Therefore the tensor is handled as having `ChannelsLast`

memory format.

When both tensors are put through the same convolution layer, they will produce slightly different results since there are different cases for `Contiguous`

and `ChannelsLast`

tensors: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/ConvolutionMM2d.cpp#L247-L283:

```
if (is_channels_last) {
// ...
at::native::cpublas::gemm(...);
} else {
// ...
at::native::cpublas::gemm(...);
}
```

The difference is in 10^-8, so in practice it should not be a problem.

Also this seems to be a known issue: https://github.com/pytorch/pytorch/issues/68430#issuecomment-970895522