Position Embedding in Vision Transformers

I’m a bit confused how the position embedding in happened to each patch in the transformer. I thought Ideally we’d want each patch to have a value of (1, 2, 3, 4…) to describe the position of the patch in the image. but from the implementation here there do something like this:

 # positional embedding
            self.pos_embed = nn.Parameter(
                torch.zeros(1, num_patches, embedding_dim)

Which is quite confusing because now we have some sort of mapping instead of just a value appended to each patch. Also, there is some sort of implicit position appended to the patch right? Assume we have a patch embedding output (1, 256, 768); corresponding to (batch, num_patches, position_embedding). since we have 256 patches, then can’t our network understand that each patch is in the position of its index value? Why do we have to explicitly define a position embedding for each patch?. Also, please kindly explain the implementation above I’m not sure I understand the mapping and why its initialised to zero