Multi-system model- Batch norm VS layer norm

almog · October 12, 2020, 1:54pm

Hello,
I’m new to PyTorch
I have a regression task and I use a model that receives two different sequential inputs, produces LSTM to each input separately, concatenates the last hidden of each LSTM, and predicts a value using a linear layer of out_size 1. (my forward() function is written below)

I’m using an accumulated gradient as explained here: [How to implement accumulated gradient？] (the second option), so my model receives a single sample in each forward() call.

I want to add normalization to my model.

Is there a problem to add batch normalization because I’m using an accumulated gradient?
Should I add batch normalization or layer normalization?
Where in my model should I add the normalization? before or after LSTM?
To which part in the model should I add it? to input1 and input2 separately? after concatenation? add in both places?

My forward function in the model:

def forward(self, input1, input2):
    # input1 part
        embeds = self.word_embedding(input1) # glove word embedding 
        encoder1_out = self.encoder1(embeds) #BiLSTM
        attention_out = self.HAN(encoder1_out) # hirerchical attention network 
        
    # input2 part
        encoder2_out = self.encoder2(input2) #BiLSTM

    # combined part
       info_vector = torch.cat((attention_out, torch.flatten(encoder1_out).unsqueeze(0)), dim=1)          
       return self.linear(info_vector) # [1, hidden_dim_1 + flatten_hiddden_dim_2] -> [1]

Thank you!
Almog