Multi-tensor input, no change in epoch loss (LSTM + Linear network)

I am trying to implement a binary classifier where each sample includes time-series data with different sampling frequencies over a fixed period of time. I have not generated a full dataset but I wanted to get a framework working to make sure I understand that it can be implemented.

The problem is that my minimal working example does not exhibit any numerical (even noise) movement during training and I wonder where I have gone wrong. I think the different data tensors might be complicating things - or I have missed something essential.

For each training sample, I have N tensors (initially 2D) that correspond to different sampling frequencies. The rows of each tensor correspond to the time index and the columns are features. The there is a single label for each training sample (0 or 1). I used a custom PyTorch dataset to stack these tensors for a batch as follows (excerpt):

 def __getitem__(self, index):
        # self.tensors is a tuple of 2D data tensors (different shapes)
        # corresponding to different sampling frequencies
        features = [tensor[index] for tensor in self.tensors]
 
        if self.labels is None:
            return features
        else:
            return (features, self.labels[index])

The dataset (e.g., train_ds) can be indexed as follows:

train_ds[sample_index][0][frequency_index][...row/col...]  # For data
train_ds[sample_index][1] # For label

When used with a DataLoader, the returned features from getitem for a batch of samples (e.g., x) will be stacked as a tensor, which is indexed as follows:

x[frequency_index][sample_index][...row/col...]  # For data 

Note the effective swap of indices above when used with the DataLoader. For each training sample, the data from each of the tensors for that sample is pushed through two LSTM networks with different lengths (corresponding to the number of time steps given the different sampling frequencies). The output of the LSTM networks are then fed into a linear network, with a ReLU activation, a second linear network and finally a sigmoid output.

The model definition is:

def __init__(...)

        # LSTMs for each of the different sampling frequencies for a fixed 
        #period (e.g., each tensor has different number of time steps
        self.layers = nn.ModuleList([nn.LSTM(window_size,1) for window_size in freq_window_sizes])

        # Linear layer that combines LSTM outputs
        self.fc1 = nn.Linear(lstm_output_count,hidden_dim)
        self.fc2 = nn.Linear(hidden_dim)
        self.fc3 = nn.Sigmoid()

code is as follows:

def forward(self, x):
        y_pred = list()

        sample_count = len(x[0])
        freq_count = len(x)   

        for i in range(sample_count):
            lstm_outputs = list()

            # Different LSTMs for different frequencies of data
            for j in range(freq_count):             
                tensor_data = x[j][i]
                tensor_data = tensor_data.t()
                tensor_data.unsqueeze_(0)

                lstm_out, _ = self.layers[j](tensor_data)
                lstm_outputs += lstm_out                

            lstm_outputs = torch.cat(lstm_outputs, 0)

            # Push lstm outputs through linear network
            y = F.relu(self.fc1(lstm_outputs.t()))
            y = self.fc2(y)
            y = self.fc3(y)

            y_pred.append(y)
        
        return torch.tensor(y_pred,requires_grad=True)

The training code is as follows:

def train(model, train_loader, epochs, criterion, optimizer, device):
    for epoch in range(1, epochs + 1):
        model.train() 
        total_loss = 0
        for x, y in train_loader:
                 
            # The train_loader will feed us a batch of data
            # stacked for the batch_size (number of samples)
            # and provided seperately as daily and quarterly
            # data. e.g., x[frequency_index][sample_index][2D index for feature/timestep]
            for tensor in x:
                tensor.requires_grad_(True)
            
            x = [tensor.to(device) for tensor in x]
            
            y = y.type(torch.float32)  # Labels were integers
            y = y.to(device)
    
            optimizer.zero_grad()
        
            # get predictions from model
            y_pred = model(x)
            
            # perform backprop
            loss = criterion(y_pred, y)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
            
        print("Epoch: {}, Loss: {}".format(epoch, total_loss / len(train_loader)))

It is called as follows:

torch.manual_seed(42)

train_dl = torch.utils.data.DataLoader(train_ds, batch_size=5,shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feature_count = train_ds.get_feature_count()
window_sizes = [189, 3]
model = Model(feature_count,window_sizes).to(device)

optimizer = optim.Adam(model.parameters(),lr=0.001)

loss_fn = torch.nn.BCELoss()

train(model, train_dl, 10, loss_fn, optimizer, device)

The output looks like this:

Epoch: 1, Loss: 0.25956406459516407
Epoch: 2, Loss: 0.25956406459516407
Epoch: 3, Loss: 0.25956406459516407
Epoch: 4, Loss: 0.25956406459516407
Epoch: 5, Loss: 0.25956406459516407
Epoch: 6, Loss: 0.25956406459516407
Epoch: 7, Loss: 0.25956406459516407
Epoch: 8, Loss: 0.25956406459516407
Epoch: 9, Loss: 0.25956406459516407
Epoch: 10, Loss: 0.25956406459516407

I have adjusted the batch size, learning rate, epoch number, etc. The loss changes, but it does not move from one epoch to the next. Intuitively, the results I’m getting from the predictions are all too similar to be accurate and do not appear to change from one epoch to the next.

I am not sure what I’m doing wrong here - any hints or advice is appreciated.