Different Losses on 2 different machines

(Sebastian Raschka) #21

hm, I think I remember seeing RTX2080 specific issues regarding elsewhere (not sure if if was in this discussion forum or elsewhere) and just saw this post from november: https://www.pcbuildersclub.com/en/2018/11/broken-gpus-nvidia-apparently-no-longer-sells-the-rtx-2080-ti/

In any case, it may be that your’s may unfortunately be affected. I’d probably also go with the replacement thing now as this appears not to be an unusual case

(Stepan Ulyanin) #22

Yeah, I read they had problems with dying cores in the very beginning, I am going to post on NVIDIA dev forums tomorrow in search of some kind of utility to check the CUDA cores.

(Sebastian Raschka) #23

just FYI, I just see that there were similar issues here as well

(Stepan Ulyanin) #24

@ptrblck, @rasbt I have a follow up for the issue:

I am running the model on non-faulty GPU and I have noticed that kaiming_normal_ initialization makes my losses go up substantially. However, I had the same behavior of a model with the kaiming_normal_ initialization on my home rig with GTX 1080Ti. Is there a reason why Kaiming normal initialization substantially increases the training loss?

Without kaiming_normal_ initialization:

With kaiming_normal_ initialization:

(Sebastian Raschka) #25

Is there a reason why Kaiming normal initialization substantially increases the training loss?

No, there shouldn’t be a specific reason for Kaiming/He increasing the training loss – personally, I only notices minor differences. Actually, the default Kaiming He (normal) initialization scheme and PyTorch’s default initialization schemes look relatively similar.

Kaiming (normal) uses:


with a=0 by default.

The PyTorch default uses:


when i see that correctly from

def reset_parameters(self):
    stdv = 1. / math.sqrt(self.weight.size(1))
    self.weight.data.uniform_(-stdv, stdv)
    if self.bias is not None:
        self.bias.data.uniform_(-stdv, stdv)

But since the sqrt is in the denominator, it can be much larger. So you probably want to lower your learning rate when using kaiming_normal_. Would be curious to hear what happens if you do that. Maybe choose the learning rate as follows:

learning_rate_before * default_std(fan_in) = new_learning_rate * kaiming(fan_in)

=> earning_rate_before * default_std(fan_in) / kaiming(fan_in) = new_learning_rate

Would be curious to hear what you find…

(Stepan Ulyanin) #26

Thank you for your suggestion, I will try to do this on my simple toy model which I have built yesterday, which has exactly same problem for some reason; I have seen the increase in the training loss for up to x100 times.

(Stepan Ulyanin) #28

The issue was resolved by itself, not sure what happened :confused:

(Marcin Plata) #29

I had a very similar problem with RTX 2080 Ti. I ran the same code on three different GPUs - GTX 1050, TITAN X and RTX 2080Ti. The training process goes fine on GTX 1050 and TITAN X, but on RTX 2080Ti the loss decreases for a few steps and finally raises fast up to a random guessing level (Accuracy achieves a random efficiency). Sometimes I observed NaNs, but not often. In my case, the problem was my own loss function, which I implemented as follow:

def cross_entropy_with_one_hots(input, target):
    logsoftmax = torch.nn.LogSoftmax(dim=1)
    return torch.mean(torch.sum(- target * logsoftmax(input), 1))

Using PyTorch’s torch.nn.CrossEntropyLoss with labels instead one hots, the training process on all three GPUs goes fine. I am guessing the problems is numerical properties of my function, but I am wondering why it works on GTX 1050 and TITAN X, but not on RTX 2080Ti?

Using torch.nn.CrossEntropyLoss solves the problem, but makes impassible mixuping or label smoothing.

(Stepan Ulyanin) #30

In our case it was a faulty GPU that we sent back to the manufacturer for replacement. When the replacement came, the model worked just like it was intended to.