I understand that the out_features in Linear are often lower than the in_features to get more meaningful features but sometimes I see the out_features is higher than the in_features, sometimes it’s equal.
I noticed in the VGG19, that the first two of the three last FC layers have the same 4096 channel.
In swin-transformer we have:
Sequential(
(0): SwinTransformerBlockV2(
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): ShiftedWindowAttentionV2(
(qkv): Linear(in_features=768, out_features=2304, bias=True) #Higher
(proj): Linear(in_features=768, out_features=768, bias=True) #Equal
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=24, bias=False)
)
Does having this help anything to the networks?
I want to ask:
- What are the purposes of having higher, equal, lower out_features in the network?
- Can you provide me with some papers about this matter and network architectures having this?
I have done some experiments on the last layers of my custom network by having higher out_features than in_feature, followed by the equal, then followed with lower for the output. It gives better results sometimes