Cannot reproduce BERT training results despite following all reproducibility guideness

This is my code to set the seed values right after the imports:

def seed_everything(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
    np.random.seed(seed)  # Numpy module.
    random.seed(seed)  # Python random module.
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.use_deterministic_algorithms(True)
    
seed_everything(42)

My environment:
PyTorch version: 2.0
OS: MacOS M1
Device: MPS

This is how I define my model:

num_labels = 3
hidden_size = 768
intermediate_size = 800

class BertEncoder(nn.Module):
    def __init__(self):
        super(BertEncoder, self).__init__()
        self.encoder = BertModel.from_pretrained('bert-base-multilingual-uncased')

    def forward(self, x, mask=None):
        outputs = self.encoder(x, attention_mask=mask)
        feat = outputs[0][:, 0, :]
        return feat
    
    
class BertClassifier(nn.Module):
    def __init__(self, dropout=0.1):
        super(BertClassifier, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.classifier = nn.Linear(hidden_size, num_labels)
#         self.softmax = nn.Softmax(dim=1)
        #self.apply(self.init_bert_weights)

    def forward(self, x):
        x = self.dropout(x)
        out = self.classifier(x)
#         out = self.softmax(x)
        return out

    def init_bert_weights(self, module):
        """ Initialize the weights.
        """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()
            
src_encoder = BertEncoder()
src_classifier = BertClassifier()

src_encoder = src_encoder.to(device)
src_classifier = src_classifier.to(device)

I’m using GPU to train my model and I have also set num_workers = 0 in the Training and Validation dataloader. Despite all of this, I’m still not able to reproduce my training losses and F1 scores.

My training output on the 1st run:

Epoch: 0/3
Epoch [00/03] Step [000/127]: cls_loss=1.0705
Epoch [00/03] Step [005/127]: cls_loss=0.9886
Epoch [00/03] Step [010/127]: cls_loss=0.8697
Epoch [00/03] Step [015/127]: cls_loss=1.1442
Epoch [00/03] Step [020/127]: cls_loss=0.9821
Epoch [00/03] Step [025/127]: cls_loss=0.8301
Epoch [00/03] Step [030/127]: cls_loss=0.9174
Epoch [00/03] Step [035/127]: cls_loss=0.8881
Epoch [00/03] Step [040/127]: cls_loss=0.7934
Epoch [00/03] Step [045/127]: cls_loss=1.0184
Epoch [00/03] Step [050/127]: cls_loss=1.0952
Epoch [00/03] Step [055/127]: cls_loss=0.9670
Epoch [00/03] Step [060/127]: cls_loss=0.8665
Epoch [00/03] Step [065/127]: cls_loss=0.7878
Epoch [00/03] Step [070/127]: cls_loss=0.6154
Epoch [00/03] Step [075/127]: cls_loss=0.8608
Epoch [00/03] Step [080/127]: cls_loss=0.7064
Epoch [00/03] Step [085/127]: cls_loss=0.7867
Epoch [00/03] Step [090/127]: cls_loss=0.7772
Epoch [00/03] Step [095/127]: cls_loss=0.6452
Epoch [00/03] Step [100/127]: cls_loss=0.5981
Epoch [00/03] Step [105/127]: cls_loss=0.7518
Epoch [00/03] Step [110/127]: cls_loss=0.7248
Epoch [00/03] Step [115/127]: cls_loss=1.0563
Epoch [00/03] Step [120/127]: cls_loss=0.7010
Epoch [00/03] Step [125/127]: cls_loss=0.7213
At the end of Epoch: 0
Validation loss:  0.6659477949142456
Accuracy: 0.6971046770601337
F1 score (Macro): 0.6691835627250752
F1 score (Per class): [0.61       0.62931034 0.76824034]

And, the training output after the 2nd run:

Epoch: 0/3
Epoch [00/03] Step [000/127]: cls_loss=1.0752
Epoch [00/03] Step [005/127]: cls_loss=0.9756
Epoch [00/03] Step [010/127]: cls_loss=0.9635
Epoch [00/03] Step [015/127]: cls_loss=1.1132
Epoch [00/03] Step [020/127]: cls_loss=0.9640
Epoch [00/03] Step [025/127]: cls_loss=0.9263
Epoch [00/03] Step [030/127]: cls_loss=0.9199
Epoch [00/03] Step [035/127]: cls_loss=0.9258
Epoch [00/03] Step [040/127]: cls_loss=0.9136
Epoch [00/03] Step [045/127]: cls_loss=1.1773
Epoch [00/03] Step [050/127]: cls_loss=1.2147
Epoch [00/03] Step [055/127]: cls_loss=1.0307
Epoch [00/03] Step [060/127]: cls_loss=0.9063
Epoch [00/03] Step [065/127]: cls_loss=0.7165
Epoch [00/03] Step [070/127]: cls_loss=0.7686
Epoch [00/03] Step [075/127]: cls_loss=0.9018
Epoch [00/03] Step [080/127]: cls_loss=0.7115
Epoch [00/03] Step [085/127]: cls_loss=0.8505
Epoch [00/03] Step [090/127]: cls_loss=0.7284
Epoch [00/03] Step [095/127]: cls_loss=0.6582
Epoch [00/03] Step [100/127]: cls_loss=0.6921
Epoch [00/03] Step [105/127]: cls_loss=0.8489
Epoch [00/03] Step [110/127]: cls_loss=0.7658
Epoch [00/03] Step [115/127]: cls_loss=0.9741
Epoch [00/03] Step [120/127]: cls_loss=0.9331
Epoch [00/03] Step [125/127]: cls_loss=0.7483
At the end of Epoch: 0
Validation loss:  0.7412455081939697
Accuracy: 0.6859688195991092
F1 score (Macro): 0.6511615399484937
F1 score (Per class): [0.55681818 0.63934426 0.75732218]

As you can see, the losses and F1 scores are not equal at all. Why am I not able to reproduce my results? Is it because of M1 GPU or the MPS device? This is why I installed the latest version of Pytorch 2.0. But, I’m still unable to get the same results in subsequent training runs. What is the reason for this problem? Am I missing something? Please help. Thanks!

This might be the case, but did you try to run the code on the CPU and if so, were the results deterministic?
Could you create an issue on GitHub so that the code owners could check it, please?

Thanks for the response. Yes, running the code on CPU gives completely deterministic results. So, the problem seems to be GPU/MPS device. I have created an issue on github but haven’t got any response yet.

Is there any solution to this problem or is reproducibility not possible at all in MPS devices?