FWIW, here’s how I could work around the above issue based on information posted in this group.
[link1] [link2]. I implemented three variants of using pre-trained embeddings and the models are reproducible.
- read pre-trained embeddings and FREEZE
embed = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)
#embed_init is a numpy array with the embeddings
embedpt = torch.from_numpy(embed_init).float().to(device)
#ind_init is a numpy array with indices of words for which embedding are available
indpt = torch.from_numpy(ind_init).long().to(device)
#after model object has been instantiated
assert model.embed.weight.shape == embedpt.shape
model.embed.weight.data.copy_(embedpt)
model.embed.weight.requires_grad = False
- use PT embeddings to initialize (may be better than initializing with random values)
(same as 1) except model.embed.weight.requires_grad = True
- freeze embeddings if available, train embeddings which are not available eg. vocab has 1000 words of which you have embeddings for 700 of them which you would like to freeze, but train the other 300.
(same as 2) additionally in the training loop
optimizer.zero_grad()
loss.backward()
model.embed.weight.grad[indpt] = 0 <=======
optimizer.step()
My code is set up to do an extended grid search on multiple gpus including a patience
parameter (#epochs with no improvement in monitored quality after which training will be stopped). The best parameter set including number of epochs is saved. When the final model is run patience
is turned off. The final model now re-traces the steps of the selected grid search run, which was not happening when with the function calls in the first post.