Saving and reloading model: save and reload vocab as well?

st-vincent1 · April 28, 2021, 8:44am

When I train a text generation model and then reload it in a separate session, I get abysmal performance. I figured out that the problem might be the vocabulary changing. Since everytime I start a session I build a new vocabulary from the training data, which is generated using random split from a Dataframe, the “stoi” mapping will be different every time.

Is there a way to save the vocabulary along with the model parameters to make sure that inference will be successful?

pascal_notsawo · April 28, 2021, 9:23am

If you are using random, I suggest you look here:

pytorch : Reproducibility — PyTorch 1.8.1 documentation
numpy : numpy.random.seed — NumPy v1.15 Manual
python : random — Generate pseudo-random numbers — Python 3.9.4 documentation

I think it will allow you to have the same random split at all times.

st-vincent1 · April 28, 2021, 9:27am

Thanks for replying. I am using the following function to set the random seed at the beginning of every session:

    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)

Nevertheless, the issue remains.