OH encoded features are lost when padding and packing

Danibahnani · April 19, 2024, 10:38am

I’m trying to pass time series data to an LSTM network, the data uses both minmax scaling, and one hot encoding. I do this by splitting the data into sequences in the Dataset and then padding and packing those using collate_fn in the DataLoader. The network is then run through a training loop.

My issue is when I try to unpack the values from the lstm layer of the last batch and pass them into the network’s final linear layer in the forward function:

    def forward(self,x ):
        lstm = self.lstm
        batch_size = self.batch_size

        h0 = torch.zeros(self.num_layers,batch_size,self.hidden_size,)   
        c0 = torch.zeros(self.num_layers,batch_size,self.hidden_size,)
        print(f"Expected input size: {self.input_size}")
        print(f"Actual input size: {x.data.size()}")

        # ****** Fog feature contains only 0 for the last batch (first values, date[0]=-1) ******
        #Fog column is dropped during OH-encoding (or other preprocessing part) which causes input size inconsistency
        
        packed_lstm_out, (hn,cn) = lstm(x, (h0,c0))
        print(f"lstm_out size: {packed_lstm_out.data.size()}")
        unpacked_lstm_out, _ = pad_packed_sequence(sequence=packed_lstm_out,batch_first=True)
        print(f"Unpacked lengths: {[len(seq) for seq in unpacked_lstm_out]}")
        #unpacked_lstm_tensor = torch.stack(unpacked_lstm_out,dim=0).float().requires_grad_(True)

        print(f"Unpacked shape: {unpacked_lstm_out.shape}\n")

        output = self.fc1(unpacked_lstm_out[:,-1,:])

        return output

I am getting an error about the input size (which is supposed to be 9) being too small (8).
When I check the data passed into the error-inducing batch I can see that one of the one-hot encoded features are missing.

How can I stop this column/feature from being dropped, so that the input size is consistent for all batches?