I have a common issue related to continuing to train a pre-training model in machine translation task.
When I load the pre-training model and the new dataset, I got an error related to mismatch between tensor sizes.
size mismatch for module.decoder.fc.weight: copying a param with shape torch.Size([609, 256]) from checkpoint, the shape in current model is torch.Size([608, 256]).
size mismatch for module.decoder.fc.bias: copying a param with shape torch.Size([609]) from checkpoint, the shape in current model is torch.Size([608]).
I think this issue related to the vocabulary size self.token_embedding = nn.Embedding(vocab_size, d_model), because, whenever I build data build_vocab using only one dataset during pre-training and fine-tuning, everything works fine.
RuntimeError: Error(s) in loading state_dict for MultipleGPUs:
size mismatch for module.encoder.token_embedding.weight: copying a param with shape torch.Size([875, 256]) from checkpoint, the shape in current model is torch.Size([888, 256]).
size mismatch for module.decoder.token_embedding.weight: copying a param with shape torch.Size([628, 256]) from checkpoint, the shape in current model is torch.Size([608, 256]).
size mismatch for module.decoder.fc.weight: copying a param with shape torch.Size([628, 256]) from checkpoint, the shape in current model is torch.Size([608, 256]).
size mismatch for module.decoder.fc.bias: copying a param with shape torch.Size([628]) from checkpoint, the shape in current model is torch.Size([608]).
As you’ve already explained, the error is most likely raised due to different vocabularies in both datasets.
If you are planning to use both datasets, you should create a common vocabulary. Besides the different number of words or tokens, you would also have to make sure the common words use the same indices.
How to create ‘create a common vocabulary’ is there any tool or I have to create a class for this issue? Now, I build the vocabulary using the dataset that has the biggest number of unique words, which the smallest size, each time that I need to load a dataset.
I would probably stick to your current way of creating the vocabulary for a single dataset, but use both instead.
E.g. assuming you are using the unique target indices to create the mapping, you could extend the target list with both datasets and create a set to remove duplicates.
Note again, that you should also verify that both datasets use the same words for the same indices.