I’m working with a segmentation-classification problem statement, where I need to split some text into segments based on its emotions. I have two outputs here: a binary output denoting whether the current sentence is a segment boundary or not, and a multi-label sigmoid output that produces the list of emotions/tags that apply for the latest segment.

I’m trying to partially replicate the model given in the following URL:

The input is fed as a sliding window of size 7, with left and right contexts of a width of 3 sentences each. The central sentence is to be identified as a segment boundary.

The input sizes are:

```
torch.Size([3, 50, 300]) torch.Size([1, 50, 300]) torch.Size([3, 50, 300])
```

Since I’m feeding the elements individually, the batch size is 1. However, due to the input specifications of the CNN and LSTM layers, I am unable to correctly identify how I should feed the input.

**Problem 1:**

In the CNN layer, the input must be a 4D array of size [batch_size, num_channels, height, width]. I therefore reshaped the arrays to sizes *[3, 1, 50, 300], [1, 1, 50, 300] and [3, 1, 50, 300]*.

**Problem 2:**

The LSTM layer requires the shape [sequence_len, batch_size, input_size]. In a step before this, I had concatenated the previous output to give a shape of *[7,72]*. I unsqueezed this output to get a shape [7,1,72] and fed it that.

Here’s the model for reference:

```
class ContextEncoder(nn.Module):
def __init__(self):
super(ContextEncoder, self).__init__()
# CNN filters for context
self.cnn_k2 = nn.Conv2d(1,2,kernel_size=(2,300))
self.cnn_k3 = nn.Conv2d(1,2,kernel_size=(3,300))
self.cnn_k4 = nn.Conv2d(1,2,kernel_size=(4,300))
def forward(self, x):
x = torch.stack([torch.squeeze(self.cnn_k2(x), 3), torch.squeeze(self.cnn_k2(x), 3), torch.squeeze(self.cnn_k2(x), 3)])
x = F.max_pool2d(x, 2)
return x.view(x.shape[1], -1)
class CNN_LSTM(nn.Module):
def __init__(self, context_size):
super(CNN_LSTM, self).__init__()
# Context encoder
self.context_encoder_lr = ContextEncoder()
self.context_encoder_c = ContextEncoder()
# Hidden state generation
self.lstm = nn.LSTM(72, 36)
# Attention generation
self.attention = nn.Linear(36, 36)
# Output layers
self.emotions = nn.Linear((2 * context_size + 1) * 36, 14)
self.segment = nn.Linear((2 * context_size + 1) * 36, 2)
# Initialize hidden state
self.reset_hidden()
def reset_hidden(self):
"""Reset hidden state with (h_n, c_n)"""
self.hidden = (torch.zeros(1,1,36), torch.zeros(1,1,36))
def forward(self, x_left, x_center, x_right):
# Encode context
c_l = self.context_encoder_lr(x_left)
c_c = self.context_encoder_c(x_center)
c_r = self.context_encoder_lr(x_right)
# Stack contexts and unsqueeze batch size
c = torch.cat([c_l, c_c, c_r])
c = c.unsqueeze(1)
# Pass context through LSTM
x, self.hidden = self.lstm(c, self.hidden)
# Generate attention
attn = self.attention(x)
attn = F.softmax(attn, dim=1)
# Apply attention
x = x * attn
# Flatten layers
x = x.view(-1)
# Feed to output layers
out_emotions = self.emotions(x)
out_segment = self.segment(x)
return F.sigmoid(out_emotions), F.softmax(out_segment)
```

Are there any issues with the model? If so, can you suggest the changes and the logic?

Thanks in advance.