PyTorch's non-deterministic cross-entropy loss and the problem of reproducibility

I have a Bayesian neural netowrk which is implemented in PyTorch and is trained via a ELBO loss. I have faced some reproducibility issues even when I have the same seed and I set the following code:

# python
seed = args.seed
random.seed(seed)
logging.info("Python seed: %i" % seed)
# numpy
seed += 1
np.random.seed(seed)
logging.info("Numpy seed: %i" % seed)
# torch
seed += 1
torch.manual_seed(seed)
logging.info("Torch CPU seed: %i" % seed)
# torch cuda
seed += 1
torch.cuda.manual_seed_all(seed)
torch.cuda.manual_seed(seed)
logging.info("Torch CUDA seed: %i" % seed)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

I need to add that I use XE loss and this is not a deterministic loss in PyTorch. This is the only possible source of randomness I am aware of. What I have observed is that, when I use a large learning_rate (=0.1), I cannot reproduce my results and I see huge gaps. However, when the learning_rate is reduced by a factor of 10 (=0.01), I see that the gap disappears. My intuition is that the culprit here is the non-deterministic loss and the large lr is just a catalyzer. What do you think? I appreciate any hints and intuitions.

Could you describe how you’ve narrowed down nn.CrossEntropyLoss as the source of non-determinism?
Using torch.use_deterministic_algorithms(True) and this minimal code snippet does not show any issues:

torch.use_deterministic_algorithms(True)
criterion = nn.CrossEntropyLoss()
x = torch.randn(10, 10)
y = torch.randint(0, 10, (10,))
x = x.to("cuda")
y = y.to("cuda")
loss = criterion(x, y)
loss
# tensor(2.7674, device='cuda:0')
loss0 = loss.clone()
loss = criterion(x, y)
loss0 - loss
# tensor(0., device='cuda:0')
loss = criterion(x, y)
loss0 - loss1
# tensor(0., device='cuda:0')

I use F.cross_entropy() and I get a warning in the beginning that says [torch.nn.NLLLoss] is not deterministic.

That’s interesting as it might be caused by different PyTorch versions.
Could you post a minimal and executable code snippet which would raise the warning as well as your PyToch version, please?

UserWarning: nll_loss2d_forward_out_cuda_template does not have a deterministic implementation, but you set ‘torch.use_deterministic_algorithms(True, warn_only=True)’. You can file an issue at Issues · pytorch/pytorch · GitHub to help us prioritize adding deterministic support for this operation. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/Context.cpp:82.)
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
~/anaconda3_new/envs/first_env/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: scatter_add_cuda_kernel does not have a deterministic implementation, but you set ‘torch.use_deterministic_algorithms(True, warn_only=True)’. You can file an issue at Issues · pytorch/pytorch · GitHub to help us prioritize adding deterministic support for this operation. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/Context.cpp:82.)

These are the exact warning messages that I get when I run my code.

I tried this code but I could not reproduce the error.

import torch
torch.autograd.set_detect_anomaly(True)
torch.use_deterministic_algorithms(True, warn_only=True)
import torch.nn as nn
from torch.nn import functional as F

x = torch.randn(10, 10)
y = torch.randint(0, 10, (10,))
x = x.to("cuda")
y = y.to("cuda")
loss = F.cross_entropy(x, y)
print(loss)

My env include:
pytorch 1.12.1 py3.9_cuda11.6_cudnn8.3.2_0 pytorch
cudatoolkit 11.6.0

So my code snippet does not reproduce the warning but your does? Could you still post your code which raises the warning, please?

so, the code snipper I tried does not reproduce the code but when I run my original code I get the error.
This is the module by which I calculate the loss. It is a Bayesian Neural network used for continual learning and the loss is the ELBO loss.

from __future__ import absolute_import
from __future__ import print_function
from tkinter import N

import torch
torch.autograd.set_detect_anomaly(True)
torch.use_deterministic_algorithms(True, warn_only=True)
import torch.nn as nn
from torch.nn import functional as F
import pdb


def _accuracy(output, target, topk=(1,)):

    #pdb.set_trace()

    maxk = max(topk)
    batch_size = target.size(0)
    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1))
    res = []
    for k in topk:
        correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
        res.append(correct_k.mul_(100.0 / batch_size))
    return res


class ClassificationLossVI(nn.Module):
    def __init__(self, args, topk=3):
        super(ClassificationLossVI, self).__init__()
        self._topk = tuple(range(1, topk+1))
        self.label_trick = args.label_trick
        self.label_trick_valid = args.label_trick_valid
        self.coreset_training = args.coreset_training
        self.coreset_kld = args.coreset_kld
        self.merged_training = args.merged_training
        
    def forward(self, output_dict, target_dict):
        samples = 1
        prediction_mean = output_dict['prediction_mean'].unsqueeze(dim=2).expand(-1, -1, samples)
        prediction_variance = output_dict['prediction_variance'].unsqueeze(dim=2).expand(-1, -1, samples)
        target = target_dict['target1'] 
        target_expanded = target.unsqueeze(dim=1).expand(-1, samples)
        normal_dist = torch.distributions.normal.Normal(torch.zeros_like(prediction_mean), torch.ones_like(prediction_mean))
        
        if self.training:
          
            losses = {}
            normals =  normal_dist.sample()
            prediction = prediction_mean + torch.sqrt(prediction_variance) * normals 
            
            # -------------------------------------------------------------------------------
            #                                 Labels trick
            # -------------------------------------------------------------------------------
            if self.label_trick is False or self.coreset_kld==1:
           
                # check the dtype of prediction tensor 
                loss = F.cross_entropy(prediction, target_expanded, reduction='mean')
                kl_div = output_dict['kl_div']
                losses['total_loss'] = loss + kl_div()
            
                with torch.no_grad():
                  p = F.softmax(prediction, dim=1).mean(dim=2)
                  losses['xe'] =  F.cross_entropy(prediction, target_expanded, reduction='mean')
                  acc_k = _accuracy(p, target, topk=self._topk)
                  for acc, k in zip(acc_k, self._topk):
                      losses["top%i" % k] = acc
            else:
                
                task_targets = [item -30 for item in target_dict['task_labels']] 
                ordered_task_targets = torch.unique(torch.Tensor(task_targets).long(), sorted=True) 
                if self.merged_training is True:
                    coreset_targets = target_dict['coresets_list'] 
                    if len(coreset_targets)>0:
                        flat_coreset_targets = [item for sublist in coreset_targets for item in sublist] 
                        seen_targets=torch.cat((torch.Tensor(flat_coreset_targets),torch.Tensor(task_targets)), 0)
                        ordered_task_targets = torch.unique(seen_targets, sorted=True).long() 
                        
                # Get the current batch labels (and sort them for reassignment)
                labels = target.clone().detach() 
                for t_idx, t in enumerate(ordered_task_targets):
                    labels[labels==t] = t_idx
         
                labels_expanded = labels.unsqueeze(dim=1).expand(-1, samples)  
                loss_label_trick = F.cross_entropy(prediction[:, ordered_task_targets, :], labels_expanded, reduction='mean')
                kl_div = output_dict['kl_div']
                losses['total_loss'] = loss_label_trick + kl_div()

                with torch.no_grad():
                    p = F.softmax(prediction[:, ordered_task_targets, :], dim=1).mean(dim=2)
                    losses['xe'] =  F.cross_entropy(prediction[:, ordered_task_targets, :], labels_expanded, reduction='mean')
                    acc_k = _accuracy(p, labels, topk=self._topk)
                    for acc, k in zip(acc_k, self._topk):
                        losses["top%i" % k] = acc      
            # ---------------------------------------------------------------------------------------------------
            
        else:
              
              if self.label_trick and self.label_trick_valid: 
                   
                    with torch.no_grad():
                        normals = normal_dist.sample()  
                        prediction = prediction_mean + torch.sqrt(prediction_variance) * normals
                        
                        labels = target.clone().detach() 
                        
                        task_targets = target_dict['task_labels'][0] #shape: [10, 10]
                        ordered_task_targets = torch.unique(task_targets, sorted=True)
                        
                        for t_idx, t in enumerate(ordered_task_targets):
                            labels[labels==t] = t_idx

                        losses = {}
                        kl_div = output_dict['kl_div']
                    
                        p = F.softmax(prediction[:, ordered_task_targets, :], dim=1).mean(dim=2)
                        losses['total_loss'] = - torch.log(p[range(p.shape[0]), labels]).mean() + kl_div()
                        losses['xe'] = - torch.log(p[range(p.shape[0]), labels]).mean()

                        acc_k = _accuracy(p, labels, topk=self._topk)
                        for acc, k in zip(acc_k, self._topk):
                            losses["top%i" % k] = acc
              else: 
                    pdb.set_trace()
                    with torch.no_grad():
                        normals = normal_dist.sample()
                        prediction = prediction_mean + torch.sqrt(prediction_variance) * normals 
                        p = F.softmax(prediction, dim=1).mean(dim=2)
                        losses = {}
                        kl_div = output_dict['kl_div']
                        losses['total_loss'] = - torch.log(p[range(p.shape[0]), target]).mean() + kl_div()
                        losses['xe'] = - torch.log(p[range(p.shape[0]), target]).mean()
                    
                        acc_k = _accuracy(p, target, topk=self._topk)
                        for acc, k in zip(acc_k, self._topk):
                            losses["top%i" % k] = acc
        return losses

    def set_coreset_kld_flag(self, _flag):
        self.coreset_kld=_flag

Thanks for the code. The error is raised in the 2D implementation of nll_loss in these lines of code.
You might be able to avoid this issue by using reduction="none" and applying the reduction explicitly afterwards as seen here:

torch.use_deterministic_algorithms(True)
criterion = nn.CrossEntropyLoss()
x = torch.randn(10, 10, 24, 24)
y = torch.randint(0, 10, (10, 24, 24))
x = x.to("cuda")
y = y.to("cuda")
loss = criterion(x, y)
# RuntimeError: nll_loss2d_forward_out_cuda_template does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'

# alternative
criterion = nn.CrossEntropyLoss(reduction="none")
loss = criterion(x, y)
loss = loss.mean()
1 Like

Thank you very much! I will try this.

I would like to add that for anyone attempting to use this solution in multiclass problems where a certain index must be excluded (and thus passing ignore_index in the builder (i.e.: nn.CrossEntropyLoss(ignore_index=1)) be careful to also ignore these indices when calculating the mean.

CrossEntropyLoss will do that for you if you use reduction="mean" but if you use "none" and then apply the mean yourself, you become responsible for excluding those. You may do so like this (assuming your ignored class index is 1):

loss = loss[y != 1].mean()