I have a model, which is extracting features from two different networks (
MyNetwork(...)) in parallel and then concatenates two obtained features from two networks. The code is similar to this post:
class MyModel(nn.Module): def __init__(self): super(MyModel, self).__init__() self.features1 = SwinTransformer3D(pretrained=None, pretrained2d=False, patch_size=self.patch_size, in_chans=1, embed_dim=dim, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=window_size, #(20,7,7), mlp_ratio=4., qkv_bias=True, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0.2, norm_layer=torch.nn.LayerNorm, patch_norm=True, frozen_stages=-1, use_checkpoint=False) self.features2 = MyNetwork(...) self.dropout = nn.Dropout(emb_dropout) self.fc1 = nn.Linear(self.num_features, self.num_features) self.fc2 = nn.Linear(self.num_features, self.num_features) self.fc_out = nn.Linear(2*self.num_features, self.num_features) def forward(self, x): x = self.to_patch_embedding(x) #ln 1 **x = x + pos_embed** **x = self.dropout(x)** x1 = self.features1(x) #ln2 x1 = x1.view(x1.size(0), -1) x1 = F.relu(self.fc1(x1)) x2 = self.features2(x) x2 = x2.view(x2.size(0), -1) x2 = F.relu(self.fc2(x2)) # Concatenate in dim1 (feature dimension) x = torch.cat((x1, x2), 2) x = self.fc_out(x) return x
I have a few questions:
and the second network
class MyNetwork(nn.Module): def __init__(self): super(MyNetwork, self).__init__() self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim)) self.pos_drop = nn.Dropout(p=drop_rate) ... def forward_features(self, x): .... x = x + self.pos_embed x = self.pos_drop(x) .... return x
and it is added in its
forward_features() method, does not mismatch with the position of the token in
SwinTransformer() and should I remove it from this class of network?
Since I have the same input patches (x) for both networks, should I add the position embedding to
forward()method (between ln1 and ln2)and remove the
self.pos_dropfrom MyNetwork() and also remove the
self.pos_drop = nn.Dropout(p=drop_rate)
How this may affect training?
I would really appreciate if you give your expert opinion on this. where should I add
pos_drop when we have a model that is combined from two different models (in parallel) and each extracting two different features?