Training with variable length sequences with LSTM

Hi all! I’ve gone through a bunch of similar posts about this topic, and while I’ve figured out the idea of needing to use padding and packing, I still haven’t been able to find how to properly pass this data into a loss function. Could anyone check if my logic of code makes sense? Thanks in advance!

  1. I create a padded set of data as follows:
seq_lengths = torch.LongTensor(list(map(len, observations_history)))
observations_history = pad_sequence(observations_history).to(self.device)
  1. Then I pass that padded data and sequence length data into the forward pass of my neural network:
# Embedding layer
x = F.relu(self.fc1(x))

# Padded LSTM layer
x = pack_padded_sequence(x, seq_lengths, enforce_sorted=False)
x, _ = self.lstm1(x)
x, x_unpacked_len = pad_packed_sequence(x)
time_dimension = torch.tensor(0).to(x.device)
last_timestep_idx = (
    (seq_lengths - 1).view(-1, 1).expand(len(seq_lengths), x.size(2))
last_timestep_idx = last_timestep_idx.unsqueeze(time_dimension)
x = x.gather(time_dimension, last_timestep_idx).squeeze(time_dimension)

# Remaining layers
x = F.relu(self.fc2(x))
q_values = self.fc3(x)
return q_values
  1. Those q_values then get passed into an MSE loss function: qf1_loss = 0.5 * F.mse_loss(qf1_a_values, next_q_value)

A couple of questions here:

  1. My first embedding FC layer directly uses the padded data. Is this ok? It obviously seems a bit of a waste computationally, but I don’t think FC layers can also used packed data, can they?

  2. With all the padding and packing going on in the intermediate layers, is this backprop happening correctly using only the non-padded data? Or is it also passing gradients through the padded data and thus making incorrect gradient updates? What is the correct way to set this up?