Embeddings index out of range error

I’m not familiar with your code, but it seems you are trying to calculate the length of all words, which are used more than min_count times.
I guess the condition might reduce the number of unique words and you are thus running into the error.

The error is generally raised, as you are initializing the embedding lookup table (weight parameter in nn.Embedding) with a specific vocab_size, while the input tensor tries to index this lookup table at an invalid index.
If you want to use word indices of 8427, your nn.Embedding would have to have at least num_embeddings>=8427.

Thanks for your reply and guidance, i will look at my tokenizer, maybe the problem is there.
@ptrblck

@ptrblck PLEASE HELP ME OUT .I dont know from where the error is coming .I was using layoutlm for some tokenclassification.

this is my specification

{
“attention_probs_dropout_prob”: 0.1,
“finetuning_task”: null,
“hidden_act”: “gelu”,
“hidden_dropout_prob”: 0.1,
“hidden_size”: 768,
“initializer_range”: 0.02,
“intermediate_size”: 3072,
“is_decoder”: false,
“layer_norm_eps”: 1e-12,
“max_position_embeddings”: 512,
“max_2d_position_embeddings”: 1024,
“num_attention_heads”: 12,
“num_hidden_layers”: 12,
“num_labels”: 2,
“output_attentions”: false,
“output_hidden_states”: false,
“output_past”: true,
“pruned_heads”: {},
“torchscript”: false,
“type_vocab_size”: 2,
“use_bfloat16”: false,
“vocab_size”: 30522
}

! python run_seq_labeling.py
–data_dir /content/drive/MyDrive/dataset_short/LLM/working/dataset
–labels /content/drive/MyDrive/dataset_short/LLM/working/dataset/labels.txt
–model_name_or_path “{pretrained_model_folder}”
–model_type layoutlm
–max_seq_length 512
–do_lower_case
–do_train
–num_train_epochs 10
–logging_steps 50
–save_steps -1
–output_dir output
–overwrite_output_dir
–per_gpu_train_batch_size 8
–per_gpu_eval_batch_size 16

You are most likely running into the same error as already described in this topic, i.e. your input tensor contains indices which are out-of-bounds for the embedding layer. Did you read my previous answers, which explain which value range is expected?

@ptrblck I’m totally new to it. Can you again say what exactly I should do now please. Any edit that can make it work. Actually it was working fine before for less labels as I increased one label its showing this error.

You have to check the range of the input tensor to the nn.Embedding layer and make sure its values are in [0, num_embeddings-1].
Here is another small example showing the IndexError:

# create embedding layer which expects inputs with indices in [0, num_embeddings-1]
num_embeddings = 10
embedding_dim = 5
emb = nn.Embedding(num_embeddings, embedding_dim)

# valid input since in range [0, 9]
x = torch.tensor([0, 4, 9])
out = emb(x)

# invalid input as it contains indices with are out-of-bounds
x = torch.tensor([0, 4, 10]) # 10 is invalid!
out = emb(x)
# IndexError: index out of range in self

Embeddings use the input to index their weight as a lookup table. If the index is invalid an error is raised.

Embedding layers expect integer inputs, not floats, so I don’t fully understand your statement.

I was getting the error again and again, and after some time I found that I was using the wrong model everytime .You are correct, I think that I need to review my basics. Sorry for the wrong lead