Different test results with a fixed weights and fixed seed

I am currently trying to understand whether the situation that I’ve encountered is a normal behaviour or a bug. I run experiments that include training models in a simulated distributed environment. Without going into unnecessary details, in each round, clients train on a local trainset, test it against the local test set and report the values. The values are then stored in a csv file together with models that were tested.

To run a validation check, I fix the seed and load the local model and the local test set. Subsequently, I perform a test evaluation. What bothers me is the fact that the values reported in a csv file (test values recorded during simulation) are not fully aligned with test values that I obtain when checking the simulation validity afterwards.

This implies that the same model (with the same weights) tested on the same dataset and with fixed seed obtains two different results. As the model stabilizes (for N rounds, we will have N different models), the difference between the value reported in a csv file and one obtained during replication is close to 0. As an example, I am pasting the log below:

0: Iteration, Abs. Loss Diff.: 1.0015488862991333, Abs. Acc. Diff.: 0.25
1: Iteration, Abs. Loss Diff.: 0.0009263801574705965, Abs. Acc. Diff.: 0.0
2: Iteration, Abs. Loss Diff.: 0.005786736011505145, Abs. Acc. Diff.: 0.0
3: Iteration, Abs. Loss Diff.: 0.003574820756912178, Abs. Acc. Diff.: 0.0
4: Iteration, Abs. Loss Diff.: 0.007152392864227308, Abs. Acc. Diff.: 0.0
5: Iteration, Abs. Loss Diff.: 0.0015836870670318248, Abs. Acc. Diff.: 0.0
6: Iteration, Abs. Loss Diff.: 0.00476664781570435, Abs. Acc. Diff.: 0.0
7: Iteration, Abs. Loss Diff.: 0.003446925878524798, Abs. Acc. Diff.: 0.0
8: Iteration, Abs. Loss Diff.: 0.0017982900142670122, Abs. Acc. Diff.: 0.0
9: Iteration, Abs. Loss Diff.: 0.0006368839740753529, Abs. Acc. Diff.: 0.0
10: Iteration, Abs. Loss Diff.: 0.009332650899887107, Abs. Acc. Diff.: 0.0
11: Iteration, Abs. Loss Diff.: 0.0002723556756972778, Abs. Acc. Diff.: 0.0
12: Iteration, Abs. Loss Diff.: 0.010622120499610865, Abs. Acc. Diff.: 0.0
13: Iteration, Abs. Loss Diff.: 0.004144576042890535, Abs. Acc. Diff.: 0.0
14: Iteration, Abs. Loss Diff.: 0.00525180220603938, Abs. Acc. Diff.: 0.0
15: Iteration, Abs. Loss Diff.: 0.013058926761150391, Abs. Acc. Diff.: 0.0
16: Iteration, Abs. Loss Diff.: 0.008403560966253276, Abs. Acc. Diff.: 0.0
17: Iteration, Abs. Loss Diff.: 0.012890378683805492, Abs. Acc. Diff.: 0.0
18: Iteration, Abs. Loss Diff.: 0.015538938939571367, Abs. Acc. Diff.: 0.0
19: Iteration, Abs. Loss Diff.: 0.03375539824366569, Abs. Acc. Diff.: 0.0
20: Iteration, Abs. Loss Diff.: 0.0018654009699821117, Abs. Acc. Diff.: 0.0
21: Iteration, Abs. Loss Diff.: 0.008243808336555913, Abs. Acc. Diff.: 0.0
22: Iteration, Abs. Loss Diff.: 0.00302100986242293, Abs. Acc. Diff.: 0.0
23: Iteration, Abs. Loss Diff.: 0.004521983098238702, Abs. Acc. Diff.: 0.0
24: Iteration, Abs. Loss Diff.: 0.008875386621803094, Abs. Acc. Diff.: 0.0
...
46: Iteration, Abs. Loss Diff.: 0.046149560796329814, Abs. Acc. Diff.: 0.0
47: Iteration, Abs. Loss Diff.: 0.0725968092895346, Abs. Acc. Diff.: 0.0
48: Iteration, Abs. Loss Diff.: 0.03759608950349502, Abs. Acc. Diff.: 0.0
49: Iteration, Abs. Loss Diff.: 0.05040962719998787, Abs. Acc. Diff.: 0.0

Even though the value is stabilizing, I find this behaviour strange. Can it be due to an inherent randomness of some of the PyTorch components? The full code is much to complex to demonstrate fully, but I am also including my testing function.

def test_loop(net: torch.nn,
              testdata = torch.utils.data.DataLoader):
    net.to(device)
    net.eval()
    criterion = nn.CrossEntropyLoss()
    test_loss = 0
    correct = 0
    total = 0
    y_pred = []
    y_true = []
    losses = []
    
    with torch.no_grad():
        for _, dic in enumerate(testdata):
            inputs = dic['image']
            targets = dic['label']
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = net(inputs)
            
            ######################
            outputs = outputs.cpu()
            targets = targets.cpu()
            #######################
            
            total += targets.size(0)
            test_loss = criterion(outputs, targets)
            losses.append(test_loss)
            pred = outputs.argmax(dim=1, keepdim=True)
            correct += pred.eq(targets.view_as(pred)).sum().item()
            y_pred.append(pred)
            y_true.append(targets)
    
    test_loss = np.mean(losses)
    accuracy = correct / total
    
    
    y_true = [item.item() for sublist in y_true for item in sublist]
    y_pred = [item.item() for sublist in y_pred for item in sublist]

...
    
    return {
        'test_loss': test_loss,
        'accuracy': accuracy,
...
        'false_positive_rate': false_positive_rate
    }
    

I don’t see if and where you’ve enabled deterministic algorithms as described in the Reproducibility docs. Did you check the docs and followed them?

1 Like

Yes, I’ve first followed the documents on reproductibility.

I am fixing seeds and enabling deterministic algorithms in the main script from which I am running the simulation. The script opens with imports and the following lines:

random.seed(42)
np.random.seed(42)
torch.cuda.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Then I call CUBLAS_WORKSPACE_CONFIG=:16:8 python script.py args. The script calls other libraries and modules and runs a full simulation cycle.
When I am analyzing the results in the jupyter notebook, I am using the same commands:

random.seed(42)
np.random.seed(42)
torch.cuda.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

The only thing that I am not fixing is the seed for the testloader. However, given that the datasets are already partitioned into training/testing data…my guess is that it should not make much difference (?). But maybe I am wrong on this one.