I am trying to modify this tutorial (classifying names with a character level rnn) to enable mini batch training.
I am familiar with
pack_padded_sequence, but this is only applicable to predefined RNN modules (e.g., nn.LSTM, nn.GRU, etc) but not to custom RNNs.
In this tutorial, they design an RNN manually, where input embedding is concatenated with previous hidden state to be fed into a linear layer, as shown in the following code.
def forward(self, input, hidden): combined = torch.cat((input, hidden), 1) hidden = self.i2h(combined) output = self.i2o(combined) output = self.softmax(output) return output, hidden
Given names of people as sequence of characters, they feed a character one by one, and when the last character is fed into the model, they compute the loss between the output of the last time step and the ground truth (See below). For example, given a Korean name ‘ahn’, they feed ‘a’ -> ‘h’ -> ‘n’ and predict the label of ‘ahn’.
# This only handles a single sequence def train(category_tensor, line_tensor): hidden = rnn.initHidden() rnn.zero_grad() for i in range(line_tensor.size()): output, hidden = rnn(line_tensor[i], hidden) loss = criterion(output, category_tensor) loss.backward()
The above tutorial feeds a name one by one. Now, I am trying to make this work on mini-batch.
I pad each batch with 0’s upto the length maximum sequence in the batch.
For example, [[1,2,3,4],[1,2,3],[1,2]] -> [[1,2,3,4],[1,2,3,0],[1,2,0,0]].
Below is the code that I wrote to handle a batch during training.
def train(): ... for batch in batches: # Below code processes a batch hidden = rnn.initHidden() tmp = torch.empty((batch_size,n_categories)) for ei in range(max_seq_len_of_this_batch): output, hidden = rnn(batch[:,ei], hidden) # batch_length is a list that contains the length of each sequence. # In the above example, batch_length = [4,3,2]. idxs = np.argwhere((batch_length-1) == ei).ravel() tmp[idxs] = output[idxs] loss = criterion(tmp, y) totalLoss += loss.item() optimizer.zero_grad() loss.backward() optimizer.step() ...
In the above code, as I need to take out the last output of each sequence, I make another Tensor
tmp to store the output vector once a sequence reaches to its last character.
After collecting the last output of each sequence in
tmp, I compute loss and do the standard process.
However, I am obtaining the accuracy that is much lower than I expect to obtain. Am I doing something wrong?