Issue with tensor shapes for NLP

I’m working with a segmentation-classification problem statement, where I need to split some text into segments based on its emotions. I have two outputs here: a binary output denoting whether the current sentence is a segment boundary or not, and a multi-label sigmoid output that produces the list of emotions/tags that apply for the latest segment.

I’m trying to partially replicate the model given in the following URL:

The input is fed as a sliding window of size 7, with left and right contexts of a width of 3 sentences each. The central sentence is to be identified as a segment boundary.
The input sizes are:

torch.Size([3, 50, 300]) torch.Size([1, 50, 300]) torch.Size([3, 50, 300])

Since I’m feeding the elements individually, the batch size is 1. However, due to the input specifications of the CNN and LSTM layers, I am unable to correctly identify how I should feed the input.

Problem 1:
In the CNN layer, the input must be a 4D array of size [batch_size, num_channels, height, width]. I therefore reshaped the arrays to sizes [3, 1, 50, 300], [1, 1, 50, 300] and [3, 1, 50, 300].

Problem 2:
The LSTM layer requires the shape [sequence_len, batch_size, input_size]. In a step before this, I had concatenated the previous output to give a shape of [7,72]. I unsqueezed this output to get a shape [7,1,72] and fed it that.

Here’s the model for reference:

class ContextEncoder(nn.Module):
    def __init__(self):
        super(ContextEncoder, self).__init__()
        
        # CNN filters for context
        self.cnn_k2 = nn.Conv2d(1,2,kernel_size=(2,300))
        self.cnn_k3 = nn.Conv2d(1,2,kernel_size=(3,300))
        self.cnn_k4 = nn.Conv2d(1,2,kernel_size=(4,300))
        
    def forward(self, x):
        x = torch.stack([torch.squeeze(self.cnn_k2(x), 3), torch.squeeze(self.cnn_k2(x), 3), torch.squeeze(self.cnn_k2(x), 3)])
        x = F.max_pool2d(x, 2)
        return x.view(x.shape[1], -1)

class CNN_LSTM(nn.Module):
    def __init__(self, context_size):
        super(CNN_LSTM, self).__init__()
        
        # Context encoder
        self.context_encoder_lr = ContextEncoder()
        self.context_encoder_c  = ContextEncoder()
        
        # Hidden state generation
        self.lstm = nn.LSTM(72, 36)
        
        # Attention generation
        self.attention = nn.Linear(36, 36)
        
        # Output layers
        self.emotions = nn.Linear((2 * context_size + 1) * 36, 14)
        self.segment = nn.Linear((2 * context_size + 1) * 36, 2)
        
        # Initialize hidden state
        self.reset_hidden()
    
    def reset_hidden(self):
        """Reset hidden state with (h_n, c_n)"""
        self.hidden = (torch.zeros(1,1,36), torch.zeros(1,1,36))
        
    def forward(self, x_left, x_center, x_right):
        # Encode context
        c_l = self.context_encoder_lr(x_left)
        c_c = self.context_encoder_c(x_center)
        c_r = self.context_encoder_lr(x_right)
        
        # Stack contexts and unsqueeze batch size
        c = torch.cat([c_l, c_c, c_r])
        c = c.unsqueeze(1)
        
        # Pass context through LSTM
        x, self.hidden = self.lstm(c, self.hidden)
        
        # Generate attention
        attn = self.attention(x)
        attn = F.softmax(attn, dim=1)
        
        # Apply attention
        x = x * attn
        
        # Flatten layers
        x = x.view(-1)
        
        # Feed to output layers
        out_emotions = self.emotions(x)
        out_segment = self.segment(x)
        
        return F.sigmoid(out_emotions), F.softmax(out_segment)

Are there any issues with the model? If so, can you suggest the changes and the logic?
Thanks in advance.