Multi-system model- Batch norm VS layer norm

Hello,
I’m new to PyTorch :slight_smile:
I have a regression task and I use a model that receives two different sequential inputs, produces LSTM to each input separately, concatenates the last hidden of each LSTM, and predicts a value using a linear layer of out_size 1. (my forward() function is written below)

I’m using an accumulated gradient as explained here: [How to implement accumulated gradient?] (the second option), so my model receives a single sample in each forward() call.

I want to add normalization to my model.

  1. Is there a problem to add batch normalization because I’m using an accumulated gradient?
  2. Should I add batch normalization or layer normalization?
  3. Where in my model should I add the normalization? before or after LSTM?
  4. To which part in the model should I add it? to input1 and input2 separately? after concatenation? add in both places?

My forward function in the model:

def forward(self, input1, input2):
    # input1 part
        embeds = self.word_embedding(input1) # glove word embedding 
        encoder1_out = self.encoder1(embeds) #BiLSTM
        attention_out = self.HAN(encoder1_out) # hirerchical attention network 
        
    # input2 part
        encoder2_out = self.encoder2(input2) #BiLSTM

    # combined part
       info_vector = torch.cat((attention_out, torch.flatten(encoder1_out).unsqueeze(0)), dim=1)          
       return self.linear(info_vector) # [1, hidden_dim_1 + flatten_hiddden_dim_2] -> [1]

Thank you!
Almog