AssertionError: Padding_idx must be within num_embeddings

Hi there, need your help!

I was using some older transformers code and it was working fine for last 2 month until suddenly it started giving this error in a place that never changed:

AssertionError                            Traceback (most recent call last)
<ipython-input-31-c365f437b895> in <module>()
      9 tokenizer = tokenizer_class.from_pretrained(args['model_name'])
---> 11 model = model_class.from_pretrained(args['model_name'])

3 frames
/usr/local/lib/python3.6/dist-packages/pytorch_transformers/ in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    535         # Instantiate model.
--> 536         model = cls(config, *model_args, **model_kwargs)
    538         if state_dict is None and not from_tf:

/usr/local/lib/python3.6/dist-packages/pytorch_transformers/ in __init__(self, config)
    842         self.num_labels = config.num_labels
--> 844         self.transformer = XLMModel(config)
    845         self.sequence_summary = SequenceSummary(config)

/usr/local/lib/python3.6/dist-packages/pytorch_transformers/ in __init__(self, config)
    543         if config.n_langs > 1 and config.use_lang_emb:
    544             self.lang_embeddings = nn.Embedding(self.n_langs, self.dim)
--> 545         self.embeddings = nn.Embedding(self.n_words, self.dim, padding_idx=self.pad_index)
    546         self.layer_norm_emb = nn.LayerNorm(self.dim, eps=config.layer_norm_eps)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/ in __init__(self, num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse, _weight)
     86         if padding_idx is not None:
     87             if padding_idx > 0:
---> 88                 assert padding_idx < self.num_embeddings, 'Padding_idx must be within num_embeddings'
     89             elif padding_idx < 0:
     90                 assert padding_idx >= -self.num_embeddings, 'Padding_idx must be within num_embeddings'

AssertionError: Padding_idx must be within num_embeddings

The error message points to the padding_idx, which seems to be larger than the num_embeddings.
From the docs:

If given, pads the output with the embedding vector at padding_idx (initialized to zeros) whenever it encounters the index.

Which means that padding_index should have a value within the num_embeddings.
How did you define num_embeddings and what are you using for padding_idx?

I don’t define it directly since my code is just a consumer of Huggingface API. I set args like that

args = {
    'data_dir': '',
    'model_type': 'xlm',
    'model_name': 'xlm-clm-ende-1024',
    'task_name': 'binary',
    'output_dir': 'outputs/',
    'cache_dir': 'cache/',
    'do_train': True,
    'do_eval': True,
    'fp16': False,
    'fp16_opt_level': 'O1',
    'max_seq_length': 256,
    'output_mode': 'classification',
    'train_batch_size': 32,
    'eval_batch_size': 32,

    'gradient_accumulation_steps': 1,
    'num_train_epochs': 3,
    'weight_decay': 0,
    'learning_rate': 4e-5,
    'adam_epsilon': 1e-8,
    'warmup_steps': 0,
    'max_grad_norm': 1.0,

    'logging_steps': 50,
    'evaluate_during_training': False,
    'save_steps': 2000,
    'eval_all_checkpoints': True,

    'overwrite_output_dir': True,
    'reprocess_input_data': True,
    'notes': 'Using train.csv'

Then load pre-trained models


    'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer),

    'roberta': (RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer)


config_class, model_class, tokenizer_class = MODEL_CLASSES[args['model_type']]

config = config_class.from_pretrained(args['model_name'], num_labels=2, finetuning_task=args['task_name'])

tokenizer = tokenizer_class.from_pretrained(args['model_name'])

model = model_class.from_pretrained(args['model_name']);

task = args['task_name']

processor = processors[task]()

label_list = processor.get_labels()

num_labels = len(label_list)

Somewhere num_embeddings and padding_index has to be set in your model.
Just skimming through the Huggingface repo, the num_embeddings for Bart are set in this line of code to num_embeddings += padding_idx + 1, which seems to be the right behavior.

I would recommend to check the GitHub issues for similar errors.
If you can’t find anything related, create an issue and ask the authors.

But I am no using Bart. I am just trying to understand why would a working model stop working. I used pre-trained XLM which I fined-tuned and reused many times for transfer learning. I am beginning to think that my model become incompatible after Huggingface updated pip module which I automatically install at runtime, and now my model is creating abnormal condition. Is there any way to “turn the clock back”? I think the code involved is here Huggingface tokenizer

Which model are you using and which version was working?

Depending how you’ve installed the wheels, you should be able to pass a version tag to pip install library==version.

This is how I install Hugginface

!pip install transformers==2.4.1
!pip install pytorch-transformers==1.2.0
!pip install tensorboardX

After that I load the pretrained XLM using the args I listed earlier.

Collecting transformers==2.4.1
  Downloading (475kB)
     |████████████████████████████████| 481kB 6.3MB/s 
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers==2.4.1) (3.0.12)
Collecting sentencepiece
  Downloading (1.0MB)
     |████████████████████████████████| 1.0MB 19.5MB/s 
Collecting sacremoses
  Downloading (883kB)
     |████████████████████████████████| 890kB 40.6MB/s 
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from transformers==2.4.1) (2.21.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.6/dist-packages (from transformers==2.4.1) (2019.12.20)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.6/dist-packages (from transformers==2.4.1) (4.38.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from transformers==2.4.1) (1.12.43)
Collecting tokenizers==0.0.11
  Downloading (3.1MB)
     |████████████████████████████████| 3.1MB 40.1MB/s 

These two lines are resulting in the error:

config = config_class.from_pretrained(args['model_name'], num_labels=2, finetuning_task=args['task_name'])
tokenizer = tokenizer_class.from_pretrained(args['model_name'])

I think intuitively that this is what changed somehow:

INFO:pytorch_transformers.modeling_utils:loading weights file from cache at /root/.cache/torch/pytorch_transformers/5a816efe624c800afaf4f903fae30944e1d8d44c13dec7e924434a02fdd585e1.b741240cfb67fdf78fd931f6e246356fa1f054ef272dbeae43d5d659d520b07d

I had exactly the same issue.
My problem solved by replacing:
from pytorch_transformers import XLMTokenizer, XLMModel
from transformers import XLMTokenizer, XLMModel

Thanks for the solution. But what do I do with

from pytorch_transformers import AdamW, WarmupLinearSchedule

What are the equivalent classes in transformers?

My apologies for the delayed response. I didn’t receive the email.

You should use the get_linear_schedule_with_warmup function instead of WarmupLinearSchedule.

The code will be:

from transformers import AdamW, get_linear_schedule_with_warmup


scheduler = WarmupLinearSchedule(optimizer, warmup_steps=WARMUP_STEPS, t_total = -1)

should be replaced with:

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps = -1)


Is transformers the newer version? (as the name suggests, since pytorch_transformers was used upto version 1.something)

If so, did you install both pytorch_transformers and transformers? I’m using the former because of Python 2.7 and my codebase worked until recently (I checked only a few days ago and noticed that XLM creates the error mentioned in the original post). If I install also transformers, I get a different error; if I do not install transformers, then “from transformers import …” does not work. How were you able to resolve this?

Edit: look at the resolution here if you need to use Python 2 and pytorch_transformers <= 1.2.0