Best approach to training RNNs

I’m working on a language modeling problem with PyTorch. I have some following dead ends stuck points.

  1. I’m building a custom RNN given below. What I’ve found from literature is converted to code here. But in official tutorials of PyTorch of RNNs don’t use tanh or hidden state as input for creating y or output. Why such an implementation?
class DetailedRNN(nn.Module):
    """
    PyTorch give us the freedom of creating the custom models we need to define,
    we are going to create our RNN cell rather than using inbuilt RNN cell of PyTorch
    """
    def __init__(self, input_size, hidden_size, output_size):
        
        super(DetailedRNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.tanh = nn.Tanh()
     
        self.i2h = nn.Linear(self.input_size + self.hidden_size, self.hidden_size)
        self.i2o = nn.Linear(self.hidden_size, self.output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden_layer = self.i2h(combined)
        hidden_layer = self.tanh(hidden_layer)
        output = self.i2o(hidden_layer)
        output = self.softmax(output)
        return output, hidden_layer
    
    def init_hidden(self):
        """
        Initialize hidden states and cell states
        """
        return torch.zeros(1, hidden_size)
  1. I’m using a character level language model. I can’t see any significant level of loss reduction as the iterations pass by. The loss is jumbling up and down and not as typical loss curve.
for epoch in range(num_epochs):
    
    random_lines = randomChunkGen(lines)
    num_steps = len(random_lines) // seq_length
    random_lines = random_lines[:num_steps * seq_length+1]
    
    for i in range(0, num_steps * seq_length, seq_length):
        
        # Get sequence length inputs and targets
        input_line = random_lines[i:i+seq_length]
        inputs = lineToInputTensor(input_line, vocab_size).to(device)

        targets_line = random_lines[(i+1):(i+1)+seq_length]
        targets = lineToTargetTensor(targets_line).to(device)
        loss = 0

        hidden_state = model.init_hidden().to(device)
        optimizer.zero_grad()
        
        for idx in range(len(input_line)):
            # Forward pass
            outputs, hidden_state = model(inputs[idx], hidden_state)
            loss += criterion(outputs, targets[idx])
        
        # Backward and optimize
        loss_list.append(float(loss))
        loss.backward(retain_graph=True)
        optimizer.step()
  1. Why CrossEntropyLoss is adamant of LongTensor target?

  2. I’m using this approach for training. I take a random chunk from the corpus. For an epoch, I traverse through this chunk, with a step size of sequence length. The loss is average loss after an entire sequence. I’ve learned that an epoch is meant to be traveling through the entire dataset once. But I switched to this model since I’ve seen official implementations like this. What is right/wrong here?

All the experts out here, I request great help from you guys :slight_smile: