Error in LSTM-based RNN training: "IndexError: index out of range" during DataLoader iteration

I’m currently working on training an LSTM-based RNN using PyTorch and encountering an error during the training loop I’ve tried various solutions to resolve this issue but I’m still facing the same error message. Here’s the relevant code snippet

import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
import pandas as pd
import json

class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(RNNModel, self).__init__()
        # Define your layers here
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        output, (h_n, c_n) = self.rnn(embedded)
        return self.fc(output)

# Load tokenized vocabulary
with open('cleaned_vocab.json', 'r', encoding='utf-8') as vocab_file:
    vocab = json.load(vocab_file)

# Load processed data from CSV
class CustomDataset(Dataset):
    def __init__(self, csv_path, max_seq_length):
        self.data = pd.read_csv(csv_path)
        self.max_seq_length = max_seq_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        text = self.data.loc[idx, 'text']
        tokens = [int(token) for token in text.split()]
        
        if len(tokens) > self.max_seq_length:
            tokens = tokens[:self.max_seq_length]
        
        padded_sequence = tokens + [0] * (self.max_seq_length - len(tokens))
        input_sequence = torch.tensor(padded_sequence[:-1])  # Input sequence without last token
        target_sequence = torch.tensor(padded_sequence[1:])   # Target sequence without first token
        return input_sequence, target_sequence

# Custom collate function
class CustomCollate:
    def __init__(self, pad_idx):
        self.pad_idx = pad_idx
    
    def __call__(self, batch):
        input_seqs, target_seqs = zip(*batch)
        padded_input_seqs = pad_sequence(input_seqs, batch_first=True, padding_value=self.pad_idx)
        padded_target_seqs = pad_sequence(target_seqs, batch_first=True, padding_value=self.pad_idx)
        return padded_input_seqs, padded_target_seqs

# Initialize custom dataset
max_sequence_length = 30  # Define your desired maximum sequence length
dataset = CustomDataset('processed_data.csv', max_sequence_length)

# Create a dataloader with custom collate function
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=CustomCollate(0))

# Initialize the RNN model
vocab_size = len(vocab)
embedding_dim = 128
hidden_dim = 256
rnn_model = RNNModel(vocab_size, embedding_dim, hidden_dim)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn_model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    for input_batch, target_batch in dataloader:
        optimizer.zero_grad()

        # Forward pass
        output = rnn_model(input_batch)

        # Calculate loss and backpropagate
        loss = criterion(output.transpose(1, 2), target_batch)
        loss.backward()
        optimizer.step()

# Save the trained model
torch.save(rnn_model.state_dict(), 'rnn_model.pth')

print("Training completed.")

I’ve verified my csv file adjusted the ‘max_seq_length’ parameter to ensure it is appropriate for my data Double-checked the data pre-processing steps, including padding and formatting

Any suggestions on how to further debug and resolve this issue would be greatly appreciated Thank you in advance!

It’s unclear where exactly the error is raised from, but I assume it’s in the Dataset.__getitem__ method.
Try to narrow down the failing line of code, print the shape of the object you want to index as well as the index value to see why this indexing operation fails.

sorry for making it unclear its at line 19 embedded=self.embedded(x) and 90 output = rnn_model(input_batch)

hope this helps more and thank you for helping

Ah OK, in this case the same applies: check the input_batch’s min/max values and make sure they are in [0, num_embeddings-1].

I followed your advice and checked the input_batch 's min/max values, which are well within the [0, num_embeddings-1] range. My vocabulary has indices ranging from 0 to 50258 and I’ve ensured that the indices in the input_batch tensor fall within this range.

I appreciate your help

In that case this operation won’t fail and you would need to narrow down which op really failed.

could you give me reasons as to why it would have failed

The embedding layer will fail with the error message if the input contains values out of bounds as seen here:

emb = nn.Embedding(10, 10)
x = torch.tensor([10])
out = emb(x)
# IndexError: index out of range in self

However, you claim that it’s not the case so another layer must fail, which you would still need to isolate, or your check is wrong and the input contains indeed invalid indices in another iteration.

thanks for the help I was using a vocab file of the internet and wanted to add my own words however it would not work after removing the training process was completed. do you know how I could add my own words
Thanks once again

I’m not sure I understand the use case entirely and don’t know what “would not work after removing the training process was completed” means. In case you’ve added more work indices without expanding the embedding layer, the error would be expected as seen in my code snippet.

Sorry there was misunderstanding due to my bad vocab, what I meant is that the in the terminal it would say “training process completed” when there are no errors. I was using a file which contained vocab mapped to a token and I had a dataset file which had random data and another dataset file which was the original data but tokenized. I had added new words to the vocab file they would be tokenized properly but once I ran the code I experienced this issue however upon removing these added words the code worked properly. What I wanted to know is how can I add my own words without facing this issue

You would need to expand the embedding layer by increasing its num_embedding argument to the new number of words.

Thank you so much for your help