PyTorch's non-deterministic cross-entropy loss and the problem of reproducibility

papoo13 · February 9, 2023, 8:27am

I have a Bayesian neural netowrk which is implemented in PyTorch and is trained via a ELBO loss. I have faced some reproducibility issues even when I have the same seed and I set the following code:

# python
seed = args.seed
random.seed(seed)
logging.info("Python seed: %i" % seed)
# numpy
seed += 1
np.random.seed(seed)
logging.info("Numpy seed: %i" % seed)
# torch
seed += 1
torch.manual_seed(seed)
logging.info("Torch CPU seed: %i" % seed)
# torch cuda
seed += 1
torch.cuda.manual_seed_all(seed)
torch.cuda.manual_seed(seed)
logging.info("Torch CUDA seed: %i" % seed)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

I need to add that I use XE loss and this is not a deterministic loss in PyTorch. This is the only possible source of randomness I am aware of. What I have observed is that, when I use a large learning_rate (=0.1), I cannot reproduce my results and I see huge gaps. However, when the learning_rate is reduced by a factor of 10 (=0.01), I see that the gap disappears. My intuition is that the culprit here is the non-deterministic loss and the large lr is just a catalyzer. What do you think? I appreciate any hints and intuitions.

ptrblck · February 9, 2023, 3:02pm

Could you describe how you’ve narrowed down nn.CrossEntropyLoss as the source of non-determinism?
Using torch.use_deterministic_algorithms(True) and this minimal code snippet does not show any issues:

torch.use_deterministic_algorithms(True)
criterion = nn.CrossEntropyLoss()
x = torch.randn(10, 10)
y = torch.randint(0, 10, (10,))
x = x.to("cuda")
y = y.to("cuda")
loss = criterion(x, y)
loss
# tensor(2.7674, device='cuda:0')
loss0 = loss.clone()
loss = criterion(x, y)
loss0 - loss
# tensor(0., device='cuda:0')
loss = criterion(x, y)
loss0 - loss1
# tensor(0., device='cuda:0')

papoo13 · February 9, 2023, 4:41pm

I use F.cross_entropy() and I get a warning in the beginning that says [torch.nn.NLLLoss] is not deterministic.

ptrblck · February 9, 2023, 8:46pm

That’s interesting as it might be caused by different PyTorch versions.
Could you post a minimal and executable code snippet which would raise the warning as well as your PyToch version, please?

papoo13 · February 10, 2023, 7:03am

UserWarning: nll_loss2d_forward_out_cuda_template does not have a deterministic implementation, but you set ‘torch.use_deterministic_algorithms(True, warn_only=True)’. You can file an issue at Issues · pytorch/pytorch · GitHub to help us prioritize adding deterministic support for this operation. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/Context.cpp:82.)
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
~/anaconda3_new/envs/first_env/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: scatter_add_cuda_kernel does not have a deterministic implementation, but you set ‘torch.use_deterministic_algorithms(True, warn_only=True)’. You can file an issue at Issues · pytorch/pytorch · GitHub to help us prioritize adding deterministic support for this operation. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/Context.cpp:82.)

These are the exact warning messages that I get when I run my code.

papoo13 · February 10, 2023, 7:06am

I tried this code but I could not reproduce the error.

import torch
torch.autograd.set_detect_anomaly(True)
torch.use_deterministic_algorithms(True, warn_only=True)
import torch.nn as nn
from torch.nn import functional as F

x = torch.randn(10, 10)
y = torch.randint(0, 10, (10,))
x = x.to("cuda")
y = y.to("cuda")
loss = F.cross_entropy(x, y)
print(loss)

My env include:
pytorch 1.12.1 py3.9_cuda11.6_cudnn8.3.2_0 pytorch
cudatoolkit 11.6.0

ptrblck · February 10, 2023, 1:53pm

So my code snippet does not reproduce the warning but your does? Could you still post your code which raises the warning, please?

papoo13 · February 10, 2023, 4:58pm

so, the code snipper I tried does not reproduce the code but when I run my original code I get the error.
This is the module by which I calculate the loss. It is a Bayesian Neural network used for continual learning and the loss is the ELBO loss.

from __future__ import absolute_import
from __future__ import print_function
from tkinter import N

import torch
torch.autograd.set_detect_anomaly(True)
torch.use_deterministic_algorithms(True, warn_only=True)
import torch.nn as nn
from torch.nn import functional as F
import pdb


def _accuracy(output, target, topk=(1,)):

    #pdb.set_trace()

    maxk = max(topk)
    batch_size = target.size(0)
    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1))
    res = []
    for k in topk:
        correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
        res.append(correct_k.mul_(100.0 / batch_size))
    return res


class ClassificationLossVI(nn.Module):
    def __init__(self, args, topk=3):
        super(ClassificationLossVI, self).__init__()
        self._topk = tuple(range(1, topk+1))
        self.label_trick = args.label_trick
        self.label_trick_valid = args.label_trick_valid
        self.coreset_training = args.coreset_training
        self.coreset_kld = args.coreset_kld
        self.merged_training = args.merged_training
        
    def forward(self, output_dict, target_dict):
        samples = 1
        prediction_mean = output_dict['prediction_mean'].unsqueeze(dim=2).expand(-1, -1, samples)
        prediction_variance = output_dict['prediction_variance'].unsqueeze(dim=2).expand(-1, -1, samples)
        target = target_dict['target1'] 
        target_expanded = target.unsqueeze(dim=1).expand(-1, samples)
        normal_dist = torch.distributions.normal.Normal(torch.zeros_like(prediction_mean), torch.ones_like(prediction_mean))
        
        if self.training:
          
            losses = {}
            normals =  normal_dist.sample()
            prediction = prediction_mean + torch.sqrt(prediction_variance) * normals 
            
            # -------------------------------------------------------------------------------
            #                                 Labels trick
            # -------------------------------------------------------------------------------
            if self.label_trick is False or self.coreset_kld==1:
           
                # check the dtype of prediction tensor 
                loss = F.cross_entropy(prediction, target_expanded, reduction='mean')
                kl_div = output_dict['kl_div']
                losses['total_loss'] = loss + kl_div()
            
                with torch.no_grad():
                  p = F.softmax(prediction, dim=1).mean(dim=2)
                  losses['xe'] =  F.cross_entropy(prediction, target_expanded, reduction='mean')
                  acc_k = _accuracy(p, target, topk=self._topk)
                  for acc, k in zip(acc_k, self._topk):
                      losses["top%i" % k] = acc
            else:
                
                task_targets = [item -30 for item in target_dict['task_labels']] 
                ordered_task_targets = torch.unique(torch.Tensor(task_targets).long(), sorted=True) 
                if self.merged_training is True:
                    coreset_targets = target_dict['coresets_list'] 
                    if len(coreset_targets)>0:
                        flat_coreset_targets = [item for sublist in coreset_targets for item in sublist] 
                        seen_targets=torch.cat((torch.Tensor(flat_coreset_targets),torch.Tensor(task_targets)), 0)
                        ordered_task_targets = torch.unique(seen_targets, sorted=True).long() 
                        
                # Get the current batch labels (and sort them for reassignment)
                labels = target.clone().detach() 
                for t_idx, t in enumerate(ordered_task_targets):
                    labels[labels==t] = t_idx
         
                labels_expanded = labels.unsqueeze(dim=1).expand(-1, samples)  
                loss_label_trick = F.cross_entropy(prediction[:, ordered_task_targets, :], labels_expanded, reduction='mean')
                kl_div = output_dict['kl_div']
                losses['total_loss'] = loss_label_trick + kl_div()

                with torch.no_grad():
                    p = F.softmax(prediction[:, ordered_task_targets, :], dim=1).mean(dim=2)
                    losses['xe'] =  F.cross_entropy(prediction[:, ordered_task_targets, :], labels_expanded, reduction='mean')
                    acc_k = _accuracy(p, labels, topk=self._topk)
                    for acc, k in zip(acc_k, self._topk):
                        losses["top%i" % k] = acc      
            # ---------------------------------------------------------------------------------------------------
            
        else:
              
              if self.label_trick and self.label_trick_valid: 
                   
                    with torch.no_grad():
                        normals = normal_dist.sample()  
                        prediction = prediction_mean + torch.sqrt(prediction_variance) * normals
                        
                        labels = target.clone().detach() 
                        
                        task_targets = target_dict['task_labels'][0] #shape: [10, 10]
                        ordered_task_targets = torch.unique(task_targets, sorted=True)
                        
                        for t_idx, t in enumerate(ordered_task_targets):
                            labels[labels==t] = t_idx

                        losses = {}
                        kl_div = output_dict['kl_div']
                    
                        p = F.softmax(prediction[:, ordered_task_targets, :], dim=1).mean(dim=2)
                        losses['total_loss'] = - torch.log(p[range(p.shape[0]), labels]).mean() + kl_div()
                        losses['xe'] = - torch.log(p[range(p.shape[0]), labels]).mean()

                        acc_k = _accuracy(p, labels, topk=self._topk)
                        for acc, k in zip(acc_k, self._topk):
                            losses["top%i" % k] = acc
              else: 
                    pdb.set_trace()
                    with torch.no_grad():
                        normals = normal_dist.sample()
                        prediction = prediction_mean + torch.sqrt(prediction_variance) * normals 
                        p = F.softmax(prediction, dim=1).mean(dim=2)
                        losses = {}
                        kl_div = output_dict['kl_div']
                        losses['total_loss'] = - torch.log(p[range(p.shape[0]), target]).mean() + kl_div()
                        losses['xe'] = - torch.log(p[range(p.shape[0]), target]).mean()
                    
                        acc_k = _accuracy(p, target, topk=self._topk)
                        for acc, k in zip(acc_k, self._topk):
                            losses["top%i" % k] = acc
        return losses

    def set_coreset_kld_flag(self, _flag):
        self.coreset_kld=_flag

ptrblck · February 13, 2023, 5:57pm

Thanks for the code. The error is raised in the 2D implementation of nll_loss in these lines of code.
You might be able to avoid this issue by using reduction="none" and applying the reduction explicitly afterwards as seen here:

torch.use_deterministic_algorithms(True)
criterion = nn.CrossEntropyLoss()
x = torch.randn(10, 10, 24, 24)
y = torch.randint(0, 10, (10, 24, 24))
x = x.to("cuda")
y = y.to("cuda")
loss = criterion(x, y)
# RuntimeError: nll_loss2d_forward_out_cuda_template does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'

# alternative
criterion = nn.CrossEntropyLoss(reduction="none")
loss = criterion(x, y)
loss = loss.mean()

papoo13 · February 14, 2023, 9:40am

Thank you very much! I will try this.

Nirei · December 28, 2023, 5:32pm

I would like to add that for anyone attempting to use this solution in multiclass problems where a certain index must be excluded (and thus passing ignore_index in the builder (i.e.: nn.CrossEntropyLoss(ignore_index=1)) be careful to also ignore these indices when calculating the mean.

CrossEntropyLoss will do that for you if you use reduction="mean" but if you use "none" and then apply the mean yourself, you become responsible for excluding those. You may do so like this (assuming your ignored class index is 1):

loss = loss[y != 1].mean()