Hi, currently I’m working with MAE, and got curious about the initializing trick when initialize self.patch_embed.
MAE uses timm.models.vision_transformer’s PatchEmbed, and the PatchEmbed utilizes nn.Conv2d for patchify.
Then the initialization of the PatchEmbed’s conv be like:
# initialize patch_embed like nn.Linear (instead of nn.Conv2d)
w = self.patch_embed.proj.weight.data
torch.nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
As written in the comment, weights of the conv are intentionally flatten before its initialization.
So why they make conv weights into the shape of nn.Linear’s?
Is there any advantage of doing so?
Flattening the conv weight will make a difference in the initialization and I guess the authors might see an advantage in their training using it. Did you check the paper to see if they give an explanation there?
@ptrblck Hi, Thanks for the reply.
I check the paper and didn’t find any explanation about it.
Therefore I eventually hypothesize that the purpose is to mimic the behavior of patchify that original ViT does.
Even though the original ViT also uses a conv layer in its patchify, the way it actually has to do is to reshape the image into B x N x P^2*C, and apply linear transformation in dim=-1.
Therefore, I think the MAE flattens the conv weight to make the conv layer behave like a linear layer.