Problem with pack_sequence lstm classifier

I want to train a model to predict comma insertion in text. I’m using 50 dimensional word embeddings as feature inputs and an associated comma feature 0/1 as target for each word/time step. Since sentences can have different length and I want to do minibatching I’m using pack_sequence on the sorted input. Now to my problem. The model seems to converge in the sense that loss is decreasing, but the predictions seem to be arbitrary, or rather tend to shift towards predicting fewer commas for each epoch. I have tried the same setup without packing the input data and just training on one sentence at a time and that is working well, so I’m pretty sure my current problem has to do with the way I pack the data or updating parameters. Below is a minimum working example of the code (excluding the training data). As you see in the code below I’m using NLLLoss and SGD. It’s a bit verbose but I wanted it to be clear what I’m doing.

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.utils.rnn import pack_sequence
import sys

torch.manual_seed(1)

class CommaPredictor(nn.Module):
    def __init__(self, CONFIG):
        super(CommaPredictor, self).__init__()
        self.in_dim = CONFIG['IN_DIM']
        self.hidden_dim = CONFIG['HIDDEN_DIM']
        self.batch_size = CONFIG['BATCH_SIZE']
        self.lstm = nn.LSTM(self.in_dim,
                            self.hidden_dim)
        self.output_layer = nn.Linear(self.hidden_dim, 2)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        return (Variable(torch.zeros(1, self.batch_size, self.hidden_dim).cuda()),
                Variable(torch.zeros(1, self.batch_size, self.hidden_dim).cuda()))

    def forward(self, x_packed):
        lstm_out, self.hidden = self.lstm(x_packed, self.hidden)
        out_space = self.output_layer(lstm_out[0])
        out_scores = F.log_softmax(out_space, dim=1)
        return out_scores


CONFIG = {'IN_DIM':50,
          'HIDDEN_DIM':256,
          'BATCH_SIZE':5,
          'LEARNING_RATE':0.01}

model = CommaPredictor(CONFIG)
model.cuda()
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=CONFIG['LEARNING_RATE'])

for epoch in range(10):
    total_loss = 0
    for iteration in range(int(len(train_seqs)/float(CONFIG['BATCH_SIZE']))):
        minibatch_seqs = train_seqs[iteration*CONFIG['BATCH_SIZE']:iteration*CONFIG['BATCH_SIZE']+CONFIG['BATCH_SIZE']]

        minibatch_sorted_seqs = sorted(minibatch_seqs, key=lambda x:len(x), reverse=True)
        minibatch_tensor_seqs = [torch.FloatTensor(seq).cuda() for seq in minibatch_sorted_seqs]
        packed_seqs = pack_sequence([seq[:,:-1] for seq in minibatch_tensor_seqs])
        packed_targets = pack_sequence([seq[:,-1].long() for seq in minibatch_tensor_seqs])
            
        model.zero_grad()
        optimizer.zero_grad()
        model.hidden = model.init_hidden()
        y_pred = model(packed_seqs)
        loss = loss_function(y_pred, packed_targets[0])
        loss.backward()
        optimizer.step()
        total_loss += loss.data
    sys.stderr.write('%f\n' %float(total_loss))

train_seqs is a nested list of the form data->sentence->word->50-dim word embeddings, binary comma feature. So I sort each minibatch by sentence length and build a FloatTensor of the word embeddings and a LongTensor for the corresponding class for each word.

Now, I’m quite new to pytorch and machine learning in general so I may have gotten something completely wrong here but since a similar setup feeding one sentence at a time gives reasonable result after a few epochs I do think the general idea should work.

To summarize: My loss is decreasing but for each epoch when checking y_pred.max(1)[1] the first column (representing no comma) is increaing and the second column (representing comma) is decreasing.

I would be very thankful if someone could help me out here!