I keep getting "index out of range in self" during forward pass

Manaswi_Mancha · January 27, 2024, 9:03pm

I am fine tuning a Longformer Encoder Decoder model for multi document text summarization. When I try to run through the forward pass, it gives me an error “index out of range in self”. The input shape seems to be correct, but the debugger points to something in torch Embedding going wrong. How do I fix this?

num_epochs = 8
num_training_steps = num_epochs * len(train_dataloader)
optimizer = Adam(MODEL.parameters(), lr=3e-5)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=1, num_training_steps=num_training_steps # CHANGE LATER!!!!!!!
)
progress_bar = tqdm(range(num_training_steps))

# Training mode
MODEL.train()

for epoch in range(num_epochs):
  for batch_idx, batch in enumerate(train_dataloader):

    # Encode data
    input_ids_all = []
    for cluster in batch["document"]:
      articles = cluster.split("|||||")[:-1]
      for i, article in enumerate(articles):
        article = article.replace("\n", " ")
        article = " ".join(article.split())
        articles[i] = article
      input_ids = []
      for article in articles:
        input_ids.extend(TOKENIZER.encode(article, truncation=True, max_length=4096 // len(articles))[1:-1])
        input_ids.append(DOCSEP_TOKEN_ID)
      input_ids = ([TOKENIZER.bos_token_id]+input_ids+[TOKENIZER.eos_token_id])
      input_ids_all.append(torch.tensor(input_ids))
      input_ids = torch.nn.utils.rnn.pad_sequence(input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID)

    # Forward pass
    global_attention_mask = torch.zeros_like(input_ids)
    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1

    print(input_ids.shape)
    # outputs = MODEL.forward(input_ids) # <---------------------------------------------------------------------------------------------- causing a bug
    outputs = MODEL.forward(input_ids=input_ids_all, global_attention_mask=global_attention_mask)

    # Backprop
    loss = outputs.loss
    loss.backward()

    # GD
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

    # Decode output
    generated_str = TOKENIZER.batch_decode(generated_ids.tolist(), skip_special_tokens=True)
    metric.add_batch(predictions=generated_str, references=batch["summary"])

    # Calculate metrics
    print(f"Epoch: {epoch+1}, Batch: {batch_idx+1}:")
    print(metric.compute())

ptrblck · January 27, 2024, 9:37pm

Check the min. and max. values in the input tensor to the embedding layer and make sure they are in the expected range [0, num_embeddings-1].

Manaswi_Mancha · January 27, 2024, 9:55pm

My input is a tensor. What do you mean by min and max values? I have 2 values in the input which is the batch size and the actual length of the sequence. How would I check the min and max values?

Manaswi_Mancha · January 27, 2024, 9:58pm

The shape of the input tensor is [10, 3047]. The embedding is (50266, 1024). Does the 3047 value need to be smaller than the 1024 value? Why did they put the max len 4096 when the embedding is 1024?

ptrblck · January 27, 2024, 11:39pm

No, the shape is irrelevant and you should check the values instead via print(input.min(); input.max()).

vdw · January 28, 2024, 5:26am

The shape of (50266, 1024) implies that you have a vocabulary size of 50266 and you want to represent each word as an embedding vector of size 1024.

Thus, you have to ensure to each word in a batch is represented by a word index of the range 0..50265 – that is vocab_size - 1, where vocab_size also reflects num_embeddings as mentioned in the previous post.