[input] -> [embeddings] -> [BiLSTM] (two hidden layers torch.cat) ->
[FC layers incl dropout] -> [output]
- grid search for best learning rate, batch size, FC - #layers, width, etc.
- chose best model based on F1 on validation
- re-train final model from scratch using “best” parameters from grid search, and compute metrics
Seeds have been set before EVERY training run
torch.manual_seed(some value) and
The following works fine (best from grid search get perfectly replicated)
- training entire network starting with uninitialized embeddings (object instantiated as
nn.Embedding(#embed, edim, padding_idx=0))
- importing embeddings and setting up embedding layer using
nn.Embedding.from_pretrained(my_embeddings, freeze=True, padding_idx=0)
When the freeze flag is changed to False, the model is not reproducible! The code is EXACTLY the same as before (as in 2). The only thing changed is the
Should I be setting some other seed? Or is this a known issue?
FWIW, here’s how I could work around the above issue based on information posted in this group.
[link1] [link2]. I implemented three variants of using pre-trained embeddings and the models are reproducible.
- read pre-trained embeddings and FREEZE
embed = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
#embed_init is a numpy array with the embeddings
embedpt = torch.from_numpy(embed_init).float().to(device)
#ind_init is a numpy array with indices of words for which embedding are available
indpt = torch.from_numpy(ind_init).long().to(device)
#after model object has been instantiated
assert model.embed.weight.shape == embedpt.shape
model.embed.weight.requires_grad = False
- use PT embeddings to initialize (may be better than initializing with random values)
(same as 1) except
model.embed.weight.requires_grad = True
- freeze embeddings if available, train embeddings which are not available eg. vocab has 1000 words of which you have embeddings for 700 of them which you would like to freeze, but train the other 300.
(same as 2) additionally in the training loop
model.embed.weight.grad[indpt] = 0 <=======
My code is set up to do an extended grid search on multiple gpus including a
patience parameter (#epochs with no improvement in monitored quality after which training will be stopped). The best parameter set including number of epochs is saved. When the final model is run
patience is turned off. The final model now re-traces the steps of the selected grid search run, which was not happening when with the function calls in the first post.