Two different sequence padding methods / dynamic nn.Linear

My model will input a batch of sequence to nn.Transformer and the output of transformer with be feeded into nn.Linear. The input sequence has different length before feeding it into nn.Transformer, so I pad the sequence to the same length in every batch adatively using collate_fn in dataloader

For example:

  • batch 1: max length of sequence in this batch is 10, padding 0 to each sequence
  • batch 2: max length of sequence in this batch is 12, padding 0 to each sequence
  • batch 3: max length of sequence in this batch is 15, padding 0 to each sequence

Now, I have a problem because the output shape of Transformer is [seq_len, batchsize, dinput], so the input dimension of nn.Linear is dynamical because of the sequence length is different.
I have seen some posts mentioned that it can use the nn.AdaptiveAvgPool to solve this problem. Now, I have two solutions and question:

  • Transpose the output of transformer into [batchsize, dinput, seq_len] using nn.AdaptiveAvgPool1d to make shape into [batchsize, dinput, fix_num], and reshape to [batchsize, -1] to feed into nn.Linear. Is this make sense? The sequence length are different but we fix it into a fix number?

  • Set a fixing maximum length, ex: 30, some sequence length is up to 25-30, but most of the sequence length are in length 5-10. The input data will be most of 0 (padding value). It will contain too much useless information!

Does anyone compare of these two padding methods or any other good suggestion?