Mismatch vocabulary size between two datasets for NMT task

Hello @all :slight_smile:

I have a common issue related to continuing to train a pre-training model in machine translation task.

When I load the pre-training model and the new dataset, I got an error related to mismatch between tensor sizes.

size mismatch for module.decoder.fc.weight: copying a param with shape torch.Size([609, 256]) from checkpoint, the shape in current model is torch.Size([608, 256]).
size mismatch for module.decoder.fc.bias: copying a param with shape torch.Size([609]) from checkpoint, the shape in current model is torch.Size([608]).

I think this issue related to the vocabulary size self.token_embedding = nn.Embedding(vocab_size, d_model), because, whenever I build data build_vocab using only one dataset during pre-training and fine-tuning, everything works fine.

Any suggestions to overcome this issue?

Kind regards,
Aiman Solyman

RuntimeError: Error(s) in loading state_dict for MultipleGPUs:
	size mismatch for module.encoder.token_embedding.weight: copying a param with shape torch.Size([875, 256]) from checkpoint, the shape in current model is torch.Size([888, 256]).
	size mismatch for module.decoder.token_embedding.weight: copying a param with shape torch.Size([628, 256]) from checkpoint, the shape in current model is torch.Size([608, 256]).
	size mismatch for module.decoder.fc.weight: copying a param with shape torch.Size([628, 256]) from checkpoint, the shape in current model is torch.Size([608, 256]).
	size mismatch for module.decoder.fc.bias: copying a param with shape torch.Size([628]) from checkpoint, the shape in current model is torch.Size([608]).

As you’ve already explained, the error is most likely raised due to different vocabularies in both datasets.
If you are planning to use both datasets, you should create a common vocabulary. Besides the different number of words or tokens, you would also have to make sure the common words use the same indices.

1 Like

Thank you sir for your comment, :slight_smile:

How to create ‘create a common vocabulary’ is there any tool or I have to create a class for this issue? Now, I build the vocabulary using the dataset that has the biggest number of unique words, which the smallest size, each time that I need to load a dataset.

I would probably stick to your current way of creating the vocabulary for a single dataset, but use both instead.
E.g. assuming you are using the unique target indices to create the mapping, you could extend the target list with both datasets and create a set to remove duplicates.

Note again, that you should also verify that both datasets use the same words for the same indices.

1 Like

Here is a solution :

TEXT = Field(...)
imdb_train, imdb_test = IMDB.splits(text_field=TEXT, ...)
snli_train, snli_valid, snli_test = SNLI.splits(text_field=TEXT, ...)
TEXT.build_vocab(imdb_train, snli_train)