After some further debugging I could pinpoint why this happens.
torchvision.io.image.read_image
produces a tensor which is the CHW permuted version of a HWC tensor. That means for a [3, 1, 2]
sized CHW tensor the strides are [1, 6, 3]
.
def print_tensor_info(x: torch.Tensor) -> None:
print('size:', x.size())
print('stride:', x.stride())
orig = torch.tensor([
[
[1, 2, 3],
[4, 5, 6],
],
])
print_tensor_info(orig)
# size: torch.Size([1, 2, 3])
# stride: (6, 3, 1)
permuted = orig.permute(2, 0, 1)
print_tensor_info(permuted)
# size: torch.Size([3, 1, 2])
# stride: (1, 6, 3)
unsqueeze
of this permuted
CHW tensor to a NCHW tensor sets an invalid batch stride of 3
:
permuted_unsqueezed = permuted.unsqueeze(dim=0)
print_tensor_info(permuted_unsqueezed)
# size: torch.Size([1, 3, 1, 2])
# stride: (3, 1, 6, 3)
The correct value would be 6
. There is only a simple heuristic which chooses that value in https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorShape.cpp#L3191-L3199:
InferUnsqueezeGeometryResult
inferUnsqueezeGeometry(const Tensor& tensor, int64_t dim) {
InferUnsqueezeGeometryResult result(tensor.sizes(), tensor.strides());
int64_t new_stride = dim >= tensor.dim() ? 1 : result.sizes[dim] * result.strides[dim];
result.sizes.insert(result.sizes.begin() + dim, 1);
result.strides.insert(result.strides.begin() + dim, new_stride);
return result;
}
This permuted_unsqueezed
tensor with sizes [1, 3, 1, 2]
and strides [3, 1, 6, 3]
is treated as having a Contiguous
memory format (and not a ChannelsLast
memory format): https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/TensorBase.h#L270-L289.
Adding zeros_like
yields the following tensor:
zeros_like = torch.zeros_like(permuted_unsqueezed)
permuted_unsqueezed_added_zeros = permuted_unsqueezed.add(zeros_like)
print_tensor_info(permuted_unsqueezed_added_zeros)
# size: torch.Size([1, 3, 1, 2])
# stride: (6, 1, 6, 3)
Note that the stride [6, 1, 6, 3]
is correct this time. Therefore the tensor is handled as having ChannelsLast
memory format.
When both tensors are put through the same convolution layer, they will produce slightly different results since there are different cases for Contiguous
and ChannelsLast
tensors: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/ConvolutionMM2d.cpp#L247-L283:
if (is_channels_last) {
// ...
at::native::cpublas::gemm(...);
} else {
// ...
at::native::cpublas::gemm(...);
}
The difference is in 10^-8, so in practice it should not be a problem.
Also this seems to be a known issue: https://github.com/pytorch/pytorch/issues/68430#issuecomment-970895522