Model not re-producible with pretrained embeddings and freeze=False

ironv · October 10, 2019, 5:33am

NN arch

[input] -> [embeddings] -> [BiLSTM] (two hidden layers torch.cat) -> 
    [FC layers incl dropout] -> [output]

Code Structure

grid search for best learning rate, batch size, FC - #layers, width, etc.
chose best model based on F1 on validation
re-train final model from scratch using “best” parameters from grid search, and compute metrics

Seeds have been set before EVERY training run torch.manual_seed(some value) and torch.cuda.manual_seed_all(some value)

The following works fine (best from grid search get perfectly replicated)

training entire network starting with uninitialized embeddings (object instantiated as nn.Embedding(#embed, edim, padding_idx=0))
importing embeddings and setting up embedding layer using
nn.Embedding.from_pretrained(my_embeddings, freeze=True, padding_idx=0)

When the freeze flag is changed to False, the model is not reproducible! The code is EXACTLY the same as before (as in 2). The only thing changed is the freeze=False.

Should I be setting some other seed? Or is this a known issue?

ironv · October 13, 2019, 4:55pm

FWIW, here’s how I could work around the above issue based on information posted in this group.
[link1] [link2]. I implemented three variants of using pre-trained embeddings and the models are reproducible.

read pre-trained embeddings and FREEZE

embed = nn.Embedding(num_embeddings, embedding_dim, padding_idx=0)

#embed_init is a numpy array with the embeddings
embedpt = torch.from_numpy(embed_init).float().to(device)
#ind_init is a numpy array with indices of words for which embedding are available
indpt = torch.from_numpy(ind_init).long().to(device)

#after model object has been instantiated
assert model.embed.weight.shape == embedpt.shape
model.embed.weight.data.copy_(embedpt)
model.embed.weight.requires_grad = False

use PT embeddings to initialize (may be better than initializing with random values)

(same as 1) except model.embed.weight.requires_grad = True

freeze embeddings if available, train embeddings which are not available eg. vocab has 1000 words of which you have embeddings for 700 of them which you would like to freeze, but train the other 300.

(same as 2) additionally in the training loop

optimizer.zero_grad()
loss.backward()
model.embed.weight.grad[indpt] = 0  <=======
optimizer.step()

My code is set up to do an extended grid search on multiple gpus including a patience parameter (#epochs with no improvement in monitored quality after which training will be stopped). The best parameter set including number of epochs is saved. When the final model is run patience is turned off. The final model now re-traces the steps of the selected grid search run, which was not happening when with the function calls in the first post.