Large differences in training and inference results on two different machines

Hey y’all, I have seen some posts discussing this issue but none of the proposed solutions have worked for me thus far.

I trained a model (S3D) and saved a checkpoint locally on machine 1, the final accuracy is supposedly 0.84 on my validation set, great! Now when using the same codebase, same seeds, and the same data to train the same model on machine 2 I get drastically different results: an accuracy of about 0.44, the learning curve went from a nice slope to a flatline, indicating that the model is just randomly guessing and not learning anything on machine 2.

I decided to debug the model by loading in my checkpoint on both machines and checking the predictions on both machines. I see slightly different numbers, possibly due to different rounding errors or FP precision. I am aware that the type of GPU differs between machines and that this might induce slight changes in FP operations.

Machine 1: array([7.2582060e-07, 9.9972206e-01, 1.5989746e-04, 2.1518445e-06,
       5.8096668e-07, 1.0367761e-06, 7.5574033e-05, 3.3640531e-06,
       1.9060394e-06, 1.7345132e-05, 1.1253899e-05, 3.9614688e-06],
      dtype=float32)

Machine 2: array([7.2582617e-07, 9.9972206e-01, 1.5989838e-04, 2.1518545e-06,
       5.8097061e-07, 1.0367791e-06, 7.5574528e-05, 3.3640756e-06,
       1.9060523e-06, 1.7345230e-05, 1.1253975e-05, 3.9614915e-06],
      dtype=float32)

This is the code I am running to verify the results. Could such tiny differences result in such a diverging training behavior? This seems unlikely to me. However, I feel like I’ve ticked all the boxes to make the training as reproducible as possible. Any ideas or advice would be greatly appreciated!

    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    val_dataset = CVBDataset(
        annotations_path=val_annot_path,
        transform=test_transform,
    )

    pretrain_checkpoint = torch.load("./checkpoints/pretrain/s3d_20240826_1CBA9A/s3d_20240826_1CBA9A_seed43_best.pth", map_location="cpu")
    model.load_state_dict(pretrain_checkpoint["model"])
    model.to(device)

    model.eval()
    subset_dataset = data.Subset(val_dataset, [1,2,3])

    dataloader = data.DataLoader(
        subset_dataset,
        batch_size=3,
        shuffle=False,
        num_workers=1,
    )

    probabilities = []
    with torch.no_grad():
        for batch_idx, batch in enumerate(dataloader):
            videos, _ = batch
            videos = videos.to(device)
            outputs = model(videos)
            
            outputs = F.softmax(outputs, dim=1)
            outputs = outputs.cpu().numpy()

            print(outputs)