Is there a way to pass an padded input into a nn.Linear layer

Hi.

I’m implementing a transformer for time series classification. For the embedding input into the transformer, I am passing the sequence into a linear layer as done in Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case and shown below:

However, for variable sequence lengths, I have to pad the input sequence to a fixed size before passing it into the input layer.

Is there any way to make the linear layer ignore the paddings I’ve added i.e. a mask of some sort?

Yes, the linear layer expects a fixed size input, so you need to pad shorter sequences.

The “ignoring” needs to be done as part of the self attention layer. You obviously don’t want any input token to attend to a padding token. This is done via masking, here using a padding mask. All off-the-shell transformer models support this. I have a Jupyter notebook that tries to explain it a bit. Maybe it’s useful.

1 Like

Thanks Chris. Yes the notebook is incredibly useful!

May I please clarify, it seems in this case the masking applies to the embedding i.e. the embedding would be the one that is padded and then the masking corresponds to the padded embedding.

In my case I would like to pad the input sequence then generate the embedding via the linear layer. So the masking needs to correspond to the padding of the input sequence which my model is not happy with because it want the masking shape to correspond to that of the embedding.

Thank you.

Not sure if I understood your question correctly.

What is padded is the input. For NLP tasks, the input are typically sentences in term of sequence of words, where each word is represented by a unique identifier (i.e., some integer) referring to some vocabulary. Padding means to add a special word/token to each sentence, often represented by reserving id 0 to represent this padding token (but can be anything as long as it’s consistent). So in the end you might have a padding sequence looking like [32, 6, 88, 542, 56, 34, 511, 0, 0, 0].

This input is pushed through the embedding layer to map each word (index) it it’s corresponding embedding vector. This means there is this one embedding vector that represents the passing token. Note that an embedding layer is just a linear layer with some conveniences.

The masking is typically done in two locations:

  • Attention layer: In the simplest case, the attention between to tokens is the dot product between the corresponding embedding vectors, which is just a value and independent from any embedding size. The masking ensures the the dot product between a normal token and a padding token is ignored.

  • Output layer: Intuitively, the loss should not be calculated based on padding tokens. So again, you want to ignore those.

In short, the padding mask only depends on the input, i.e., which parts of the sequences are padded using this special padding token index. The embedding layer is independent from that.

I see. So I’ve definitely been doing the wrong thing then. I padded the sequence then passed the entire sequence into the linear layer i.e.

def __init__(self, padded_sequence_length, ...):
    self.fc = nn.Linear(padded_sequence_length, 512)
    ...

def forward(self, batch):
    # x is a batch of sequences e.g. x = [[32, 6, 88, 542, 56, 34, 511, 0, 0, 0], ...]    
    x, y = batch
    
    input_embedding = self.fc(x)
    pos_encoder_output = self.pos_encoder(input_embedding)
    transformer_output = self.transformer_encoder(pos_encoder_output)
    ...

I got this from here. However, I’m realising that I should probably research more into the nn.Embedding layer.

Should I rather iterate over the sequence feeding each timestep into a linear layer and then append to a tensor i.e.:

def __init__(self, ...):
    self.fc = nn.Linear(1, 512)
    ...

def forward(self, batch):
    # x is a batch of sequences e.g. x = [[32, 6, 88, 542, 56, 34, 511, 0, 0, 0], ...]    
    x, y = batch
    all_input_embeddings = []
    for example in x:
        single_example_input_embeddings = [self.fc(timestep) for timestep in x]
        all_input_embeddings.append(single_example_input_embeddings)

    all_input_embeddings = torch.cat(all_input_embeddings)
    ...

Something like that?

Looking over the notebook more carefully now.

Both code snippets seem to perform the same step, and the first one is certainly the preferred one over the loop in the second snippet. Your batch x has shape of (batch_size, padded_sequence_length) and after pushing it through self.fc the shape will be (batch_size, padded_sequence_length, 512). This looks alright.

However, at some point, you need to compute the padding matrix for your input (cf. method create_mask() in the notebook) and pass it to the encoder.

Awesome, thanks.

My apologies for taking too much of your time :sweat_smile: . I think this is where I’m getting confused about the linear layer. In the first snippet, wouldn’t the shape become (batch_size, 512) i.e.:

batch_size, seq_length, linear_out_length = 2, 4, 10
input_ = torch.rand(batch_size, seq_length)
fc = torch.nn.Linear(seq_length, linear_out_length)
fc(input_).shape
>> torch.Size([2, 10]) # batch_size, linear_out_length

Thus I can’t if I pad my input to a seq_length = 6, I will always get an output of 2, 10 therefore, I don’t quite know what the mask should be.

Looking at the nn.Embedding, you’re right that it must generate an embedding for each timestep then the paddings will have their embeddings in the tensor which I can then mask.