I’m working with a segmentation-classification problem statement, where I need to split some text into segments based on its emotions. I have two outputs here: a binary output denoting whether the current sentence is a segment boundary or not, and a multi-label sigmoid output that produces the list of emotions/tags that apply for the latest segment.
I’m trying to partially replicate the model given in the following URL:
The input is fed as a sliding window of size 7, with left and right contexts of a width of 3 sentences each. The central sentence is to be identified as a segment boundary.
The input sizes are:
torch.Size([3, 50, 300]) torch.Size([1, 50, 300]) torch.Size([3, 50, 300])
Since I’m feeding the elements individually, the batch size is 1. However, due to the input specifications of the CNN and LSTM layers, I am unable to correctly identify how I should feed the input.
In the CNN layer, the input must be a 4D array of size [batch_size, num_channels, height, width]. I therefore reshaped the arrays to sizes [3, 1, 50, 300], [1, 1, 50, 300] and [3, 1, 50, 300].
The LSTM layer requires the shape [sequence_len, batch_size, input_size]. In a step before this, I had concatenated the previous output to give a shape of [7,72]. I unsqueezed this output to get a shape [7,1,72] and fed it that.
Here’s the model for reference:
class ContextEncoder(nn.Module): def __init__(self): super(ContextEncoder, self).__init__() # CNN filters for context self.cnn_k2 = nn.Conv2d(1,2,kernel_size=(2,300)) self.cnn_k3 = nn.Conv2d(1,2,kernel_size=(3,300)) self.cnn_k4 = nn.Conv2d(1,2,kernel_size=(4,300)) def forward(self, x): x = torch.stack([torch.squeeze(self.cnn_k2(x), 3), torch.squeeze(self.cnn_k2(x), 3), torch.squeeze(self.cnn_k2(x), 3)]) x = F.max_pool2d(x, 2) return x.view(x.shape, -1) class CNN_LSTM(nn.Module): def __init__(self, context_size): super(CNN_LSTM, self).__init__() # Context encoder self.context_encoder_lr = ContextEncoder() self.context_encoder_c = ContextEncoder() # Hidden state generation self.lstm = nn.LSTM(72, 36) # Attention generation self.attention = nn.Linear(36, 36) # Output layers self.emotions = nn.Linear((2 * context_size + 1) * 36, 14) self.segment = nn.Linear((2 * context_size + 1) * 36, 2) # Initialize hidden state self.reset_hidden() def reset_hidden(self): """Reset hidden state with (h_n, c_n)""" self.hidden = (torch.zeros(1,1,36), torch.zeros(1,1,36)) def forward(self, x_left, x_center, x_right): # Encode context c_l = self.context_encoder_lr(x_left) c_c = self.context_encoder_c(x_center) c_r = self.context_encoder_lr(x_right) # Stack contexts and unsqueeze batch size c = torch.cat([c_l, c_c, c_r]) c = c.unsqueeze(1) # Pass context through LSTM x, self.hidden = self.lstm(c, self.hidden) # Generate attention attn = self.attention(x) attn = F.softmax(attn, dim=1) # Apply attention x = x * attn # Flatten layers x = x.view(-1) # Feed to output layers out_emotions = self.emotions(x) out_segment = self.segment(x) return F.sigmoid(out_emotions), F.softmax(out_segment)
Are there any issues with the model? If so, can you suggest the changes and the logic?
Thanks in advance.