I am trying to implement the mask_windows in the swin-transformer architecture.
I have a mask tensor that looks like this:
tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
[0., 0., 0., 0., 1., 1., 2., 2.],
[0., 0., 0., 0., 1., 1., 2., 2.],
[0., 0., 0., 0., 1., 1., 2., 2.],
[3., 3., 3., 3., 4., 4., 5., 5.],
[3., 3., 3., 3., 4., 4., 5., 5.],
[6., 6., 6., 6., 7., 7., 8., 8.],
[6., 6., 6., 6., 7., 7., 8., 8.]])
torch.Size([1, 8, 8, 1])
I want to convert it to have the shape :
torch.Size([4, 4, 4, 1])
which must result from partitioning it into quarters, each of size 4 by 4. This explains the presence of the number 4 as the first index in the desired shape.
My initial attempt was:
windows = x.view(-1, window_size, window_size, C)
However, this approach disrupts the spatial continuity of the quarters.
And the correct way to do this is:
x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
My question revolves around the right mindset and step-by-step approach for tackling such problems. What is the proper mental framework or checklist to ensure that the output adheres to the specified requirements, aside from trial and error with small-sized tensors?
I understand that this relates to how the view and permute operations work, but I’m struggling to simplify my understanding.
Could you guide me on how to think about such transformation?