Where to add pos_embedding and pos_drop for two parallet network separately?

qm-intel · November 5, 2023, 12:56pm

I have a model, which is extracting features from two different networks (SwinTransformer3D() and MyNetwork(...)) in parallel and then concatenates two obtained features from two networks. The code is similar to this post:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.features1 = SwinTransformer3D(pretrained=None,
                 pretrained2d=False,
                 patch_size=self.patch_size, 
                 in_chans=1,
                 embed_dim=dim,
                 depths=[2, 2, 6, 2],
                 num_heads=[3, 6, 12, 24],
                 window_size=window_size, #(20,7,7),
                 mlp_ratio=4.,
                 qkv_bias=True,
                 qk_scale=None,
                 drop_rate=0.,
                 attn_drop_rate=0.,
                 drop_path_rate=0.2,
                 norm_layer=torch.nn.LayerNorm,
                 patch_norm=True,
                 frozen_stages=-1,
                 use_checkpoint=False)
        
        self.features2 = MyNetwork(...)
        self.dropout = nn.Dropout(emb_dropout)
        
        self.fc1 = nn.Linear(self.num_features, self.num_features)
        self.fc2 = nn.Linear(self.num_features, self.num_features)
        
        self.fc_out = nn.Linear(2*self.num_features, self.num_features)
        
    def forward(self, x):
        x = self.to_patch_embedding(x)  #ln 1

        **x = x + pos_embed**  
        **x = self.dropout(x)**

        x1 = self.features1(x)   #ln2
        x1 = x1.view(x1.size(0), -1)
        x1 = F.relu(self.fc1(x1))
        
        x2 = self.features2(x)
        x2 = x2.view(x2.size(0), -1)
        x2 = F.relu(self.fc2(x2))

        # Concatenate in dim1 (feature dimension)
        x = torch.cat((x1, x2), 2)
        x = self.fc_out(x)
        return x

I have a few questions:

Since SwinTransformer is computing parameters for the position bias, what is pos_drop is for in this line?

and the second network MyNetwork(...) has

 class MyNetwork(nn.Module):
        def __init__(self):
            super(MyNetwork, self).__init__()
            self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
            self.pos_drop = nn.Dropout(p=drop_rate)
            ...
        def forward_features(self, x):
            ....
            x = x + self.pos_embed
            x = self.pos_drop(x)
            ....
            return x

and it is added in its forward_features() method, does not mismatch with the position of the token in SwinTransformer() and should I remove it from this class of network?

Since I have the same input patches (x) for both networks, should I add the position embedding to MyModel, forward() method (between ln1 and ln2)and remove the self.pos_embed and self.pos_drop from MyNetwork() and also remove the

self.pos_drop = nn.Dropout(p=drop_rate)

from SwinTransformer()?

How this may affect training?

I would really appreciate if you give your expert opinion on this. where should I add pos_embed and pos_drop when we have a model that is combined from two different models (in parallel) and each extracting two different features?