Embeddings index out of range error

ptrblck · October 27, 2020, 10:26pm

I’m not familiar with your code, but it seems you are trying to calculate the length of all words, which are used more than min_count times.
I guess the condition might reduce the number of unique words and you are thus running into the error.

The error is generally raised, as you are initializing the embedding lookup table (weight parameter in nn.Embedding) with a specific vocab_size, while the input tensor tries to index this lookup table at an invalid index.
If you want to use word indices of 8427, your nn.Embedding would have to have at least num_embeddings>=8427.

elisoochi · October 31, 2020, 7:19am

Thanks for your reply and guidance, i will look at my tokenizer, maybe the problem is there.
@ptrblck

Kishan_Mishra · September 11, 2022, 7:05am

@ptrblck PLEASE HELP ME OUT .I dont know from where the error is coming .I was using layoutlm for some tokenclassification.

this is my specification

{
“attention_probs_dropout_prob”: 0.1,
“finetuning_task”: null,
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.1,
“hidden_size”: 768,
“initializer_range”: 0.02,
“intermediate_size”: 3072,
“is_decoder”: false,
“layer_norm_eps”: 1e-12,
“max_position_embeddings”: 512,
“max_2d_position_embeddings”: 1024,
“num_attention_heads”: 12,
“num_hidden_layers”: 12,
“num_labels”: 2,
“output_attentions”: false,
“output_hidden_states”: false,
“output_past”: true,
“pruned_heads”: {},
“torchscript”: false,
“type_vocab_size”: 2,
“use_bfloat16”: false,
“vocab_size”: 30522
}

Kishan_Mishra · September 11, 2022, 7:24am

! python run_seq_labeling.py
–data_dir /content/drive/MyDrive/dataset_short/LLM/working/dataset
–labels /content/drive/MyDrive/dataset_short/LLM/working/dataset/labels.txt
–model_name_or_path “{pretrained_model_folder}”
–model_type layoutlm
–max_seq_length 512
–do_lower_case
–do_train
–num_train_epochs 10
–logging_steps 50
–save_steps -1
–output_dir output
–overwrite_output_dir
–per_gpu_train_batch_size 8
–per_gpu_eval_batch_size 16

ptrblck · September 11, 2022, 7:27am

You are most likely running into the same error as already described in this topic, i.e. your input tensor contains indices which are out-of-bounds for the embedding layer. Did you read my previous answers, which explain which value range is expected?

Kishan_Mishra · September 11, 2022, 7:54am

@ptrblck I’m totally new to it. Can you again say what exactly I should do now please. Any edit that can make it work. Actually it was working fine before for less labels as I increased one label its showing this error.

ptrblck · September 11, 2022, 10:49pm

You have to check the range of the input tensor to the nn.Embedding layer and make sure its values are in [0, num_embeddings-1].
Here is another small example showing the IndexError:

# create embedding layer which expects inputs with indices in [0, num_embeddings-1]
num_embeddings = 10
embedding_dim = 5
emb = nn.Embedding(num_embeddings, embedding_dim)

# valid input since in range [0, 9]
x = torch.tensor([0, 4, 9])
out = emb(x)

# invalid input as it contains indices with are out-of-bounds
x = torch.tensor([0, 4, 10]) # 10 is invalid!
out = emb(x)
# IndexError: index out of range in self

ptrblck · July 19, 2023, 8:14am

Embeddings use the input to index their weight as a lookup table. If the index is invalid an error is raised.

Embedding layers expect integer inputs, not floats, so I don’t fully understand your statement.

Ayush_Singhal · July 23, 2023, 3:16pm

I was getting the error again and again, and after some time I found that I was using the wrong model everytime .You are correct, I think that I need to review my basics. Sorry for the wrong lead