Pytorch can't handle tokens > vocab_size but it is not the case with TF

Abir_ELTAIEF · January 24, 2022, 9:03am

I built a tokenizer from an old one “bert-base-uncased” with the same tokenization algorithm (on a specific dataset “not English language…”), but with a vocab_size of 150000 (that largely exceeds the vocab_size of the original bert (30522). When I used PyTorch to fine tune the model (with this new tokenizer), it gives me an error (index out of range), and the model crashes (when it encounters a token with an id > 30522, it can’t handle it).

Using Tensorflow, it works very well (with this vocab_size = 150000)! and does not return any errors. I could fine-tune it with tensorflow and get satisfactory results “not very very well” but up to a certain degree, satisfactory results (for the test set), but not with Pytorch (even by rebuilding a new tokenizer with vocab_size = 30522 (like original bert): the accuracy goes from 37 to 100, then to 50, then to 25, over epochs and the predictions get worse…Does anyone have an explanation for this?