How to debug with floating point differences

pytorcher · May 21, 2020, 11:37pm

I am trying to convert code from an old pytorch version on an old computer to a new one. I have it very close, but it is still not getting quite the same results. Im having trouble determining where actual changes are happening and where it is just tiny floating point differences. I already have converted my entire system to .double() but that actually seems to yield larger problems than without it. Any suggestions?

Kushaj · May 22, 2020, 6:22pm

To check for tiny floating point errors

def test_close(a, b, eps=1e-5):
    # Now depending on what a, b are you can add code here
    return abs(a-b)<eps # If scaler
    return (abs(a-b)<eps).all() # If array

pytorcher · May 22, 2020, 9:26pm

That finds the differences, the problem is I don’t know when the differences are because of floating point rounding and when they are because there is actual functionality differences.

KFrank · May 22, 2020, 9:59pm

Hi Py!

First, don’t look at differences in your results after training for a
number of (batch) iterations. It is perfectly possible for your
model parameters to wander off in distinctly different, although
content-wise, broadly-equivalent directions, after accumulated
round-off errors start them down different paths.

You should look at whether your differences are compatible with
round-off error after passing a single batch through your model,
checking your output, your loss, and your gradients after
backpropagating once. If you still think these differences exceed
reasonable, somewhat accumulated round-off error, trace
intermediate steps through the layers of your model. Unless you
have an ill-conditioned Linear in there somewhere, no single
layer should give results that differ by more than a few times
round-off error.

The next general check you can make is to move your model and
all of your data to double-precision. Round-off error in your results,
even after accumulating somewhat, should be reduced such that
you have about eight or nine additional decimal digits of precision.

You comment that converting your “entire system to .double() …
seems to yield larger problems than without it.” Is this after
training multiple iterations after your parameters may have started
wandering down different paths? Or is this still the case after only
a single forward and/or backward pass through your system?

Best.

K. Frank

pytorcher · May 22, 2020, 10:19pm

Hi K. Frank!

I am looking at values after single batches. To try to fix this for the more complicated project I am using a model is very simple, its just a conv layer, a sigmoid, and another conv. I am printing the exact weights and bias as well as the grad for every weight after each batch as well as ensuring they initialize to the same values. It gives the same up to 16 decimal places for the first few batches but then drift starts to accumulate. With double precision the drift started immediately. Maybe I am not using .double() in the correct places? I basically just made everywhere I was doing .cuda() to now be .cuda().double().

KFrank · May 23, 2020, 1:23am

Hi Py!

This is strong evidence that your two versions are doing the same
thing, up to floating-point round-off. You can’t expect any more
than this.

(As an aside, 16 decimal places is what you would expect for
double precision. So you’re actually doing better than you would
expect for single precision. This behavior is actually a bit odd,
but I can cheerfully concoct some plausible explanations for it.)

Whether this is a concern depends on the details of what you
mean by this. After a single forward or backward pass, do your
double-precision results agree to significantly less than 16 decimal
digits? If not, then you’re getting as much as you have an right
to get.

Best.

K. Frank

pytorcher · May 23, 2020, 1:01pm

Yes, after a single forward pass I get different results to less than 16 digits. I am not initializing to different random weights or passing in data in a different order. So I print out every weight in my network and and running them in the diff tool meld, and get exactly the same values up to 16 digits. But then on the very first forward pass I get losses of 1.9986734118816107 and 1.9986734009655582. When I was only using float precision things did always stay exactly the same for a few batches before the drift started. I get what you are saying in the other thread that this will happen and they will converge to equally correct minima. But I am only debugging with this very simple model, the one I am actually trying to correct once I figure this out is very complicated and uses lots of funky custom functions. The evidence something is wrong is that with the real model everything works reliably in pytorch 0.3 and does not work at all in 1.3.1.

KFrank · May 23, 2020, 9:14pm

Hi Py!

If I understand correctly, this happens when you are running everything
(as far as you know) in double precision.

This is odd. Your two results agree up to single precision, so it’s not
as if one version is completely broken. It’s as if one version does
something only in single precision, even though it should be in double
precision.

Do I understand correctly (based on your earlier post) that this means
that the two versions agree (for a few batches) up to double precision,
even though you are running in single precision?

This is also odd (although not “wrong” or impossible). Am I right
that you are saying that when running in single precision, the two
versions agree up to double precision, but when running in double
precision, the two versions only agree up to single precision (at least
for the first batch)?

Could you post a simple, complete, runnable script that illustrates
this issue, together with its output?

What does “does not work at all in 1.3.1” mean? It sounds like (in
single precision) 1.3.1 agrees with 0.3 up to double precision. So
that sounds like it’s working. (And in double precision, they agree
up to single precision, so it’s more-or-less working.)

Also, could you tell us the exact version of your 0.3.0 version?
(You can run print (torch.__version__).)

Best.

K. Frank

pytorcher · May 25, 2020, 12:35am

That is the correct understanding of the two numbers.

That is also correct, with float precision the two versions agreed for a few batches up to 32 decimal places when I was not using double precision. and yes when I switch to double it seems that they agree to 32 decimal places when I specify exact values, but then as soon as any operations are done, forward pass, backward grad or loss calculations, they disagree immediately.

Does not work at all in 1.3.1 means that the built in pytorch functions work properly and and it will converge to a slightly different local minima without using my custom functions, but when it uses my custom functions it makes no difference at all while it used to in 0.3. The old version is 0.3.0.post4.

It looks like in 1.3.1 it actually is not giving consistent values to begin with. When I run the same code initialized with the same values I can print the network weights, network output, loss value all to be exactly the same to 16 decimal points but then actual grads that get calculated only agree up to 7. I can try to make a minimal example to show code if that would help. My understanding is that floats might lose some precision, but the same math on the same system should always do the same rounding and give the same values.

Here is what I’m doing for debugging now, the second grad value changes every time past the first 7 digits while the first manually calculated grad is identical every time to all digits printed:

(Pdb) X = outputs
(Pdb) y = targets
(Pdb) m = y.shape[0]
(Pdb) grad = F.softmax(X,1)
(Pdb) grad[range(m),y] -= 1
(Pdb) grad = grad/m
(Pdb) ((net.layer1.module.layer1(inputs)).sigmoid()[:,0,0,0] * grad[:,0]).sum().tolist()
0.03555078059434891
(Pdb) (net.layer2.module.layer1.weight.grad[0][0][0][0]).tolist()
0.03555076941847801

Edit: I noticed a pattern while running this many times and decided to see if there was one. It turns out after 225 trials this second value only ever takes on 8 different values. It is possible somewhere in the pytorch code for back propagating cross entropy loss there are 3 bits that are not being set to any particular values so they get randomly assigned?

Eidt2: This seems to be based on batch size. 8 Values was with batch size of 128, with batch size 64 there are only 7 values, with batch size 16 down to 3, batch sizes 4, 2 and 1, consistently return the same correct value

KFrank · May 25, 2020, 3:58pm

Hi Py!

If I understand properly what you are saying, it sounds like you
have this issue only when you are using custom functions and
you are running on 1.3.1.

This suggests that there is a bug in 1.3.1 that is uncovered in
by your custom functions. Or, I suppose, there could be a bug
in your custom functions that 0.3.0 is letting you get away with.

Could you post the code for your custom functions?

A minimal example would be the best thing. A complete, runnable
script (with all necessary imports, hard-coded or randomly generated
sample, and anything else needed so that it can be run just by
copy-pasting it from the forum) would be very helpful.

This makes it sound like the bug – whether yours, or in 1.3.1 – is
associated with the .backward() of one of your custom functions.

A minimal example – that will presumably contain the code for
your custom functions – would be the best next step. Please
also post the output produced by your minimal example, when
run on both 0.3.0 and 1.3.1.

Also, I’m not saying that you should necessarily do this, but have
you tried upgrading to pytorch 1.5 (or maybe 1.4) to see if this
issue still persists?

[Edit:] One possibility: Are you using .data() in your custom
functions. .data() is deprecated since 0.4.0, and, although
still present, doesn’t work correctly for all use cases. Please
see the Pytorch 0.4.0 Migration Guide for details.

Best.

K. Frank

pytorcher · May 26, 2020, 7:49am

I gave the Migration Guide a read. Thanks for sending me that! I do suspect once I get to the bottom of this something from that guide will be what I did wrong since my custom functions were using .data in 0.3 and 1.3 versions. However, when making the minimal example my custom functions were not needed to replicate this problem. The following code also shows that the hand calculated grad does not match up with the autograd calculated one in 1.3.1. I made similar code in 0.3.0 and it also does not match there.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import torchvision.transforms as transforms
from torchvision import datasets

class Small_Net(nn.Module):
    def __init__(self, device):
        super(Small_Net, self).__init__()
        self.layer1 = nn.Conv2d(1, 40, kernel_size=28, stride=1, padding=0, bias=True)
        self.layer2 = nn.Conv2d(40, 10, kernel_size=1, stride=1, padding=0, bias=True)
        self.layer1.to(device)
        self.layer2.to(device)

    def forward(self, x):        
        out2 = self.layer1(x).sigmoid()
        self.midOut = out2.detach().clone()
        out3 = self.layer2(out2)            
        out6 = out3.view(out3.size(0), -1)
        return out6

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net = Small_Net(device)  
net.to(device)
net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count())).to(device)
start_epoch = 0
trainloader = torch.utils.data.DataLoader(datasets.MNIST('../mnist_data', 
                                                          download=True, 
                                                          train=True,
                                                          transform=transforms.Compose([
                                                              transforms.ToTensor(), # first, convert image to PyTorch tensor
                                                              transforms.Normalize((0.1307,), (0.3081,)) # normalize inputs
                                                          ])), 
                                           batch_size=128, 
                                           shuffle=True)
criterion = nn.CrossEntropyLoss()             
def train(epoch, mode):
    net.train()
    optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1, momentum=0, weight_decay=0)  # 5e-4)
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        batchSize, outputMaps, x, y = inputs.size()
        inputs, targets = inputs.to(device), targets.to(device)  # GPU settings        
        optimizer.zero_grad()
        output = net(inputs)
        loss = criterion(output, targets)
        loss.backward()    
        X = output
        y = targets
        m = y.shape[0]
        grad = F.softmax(X,1)
        grad[range(m),y] -= 1
        grad = grad/m
        print('is same?')        
        print(str((net.module.layer1(inputs).sigmoid()[:,0,0,0] * grad[:,0]).sum().tolist()))
        print((net.module.midOut[:,0,0,0] * grad[:,0]).sum().tolist())
        print(str((net.module.layer2.weight.grad[0][0][0][0]).tolist()))
        del inputs, targets, output
        optimizer.step() 
    
train(0, 'n')

Example output:

is same?
-0.0011598279234021902
-0.0011598279234021902
-0.0011598295532166958

Edit: with double precision it does agree much more but still not the same. Is this just the best that can be expected because of what autograd is doing under the hood?

is same?
-0.00020605623917282047
-0.00020605623917282047
-0.00020605623917282427

KFrank · May 26, 2020, 2:15pm

Hi Py!

My conclusion is that your use of .data() is the cause of “does
not work at all in 1.3.1.”

Note, what you show below does not replicate the "does not work at
all " problem – it replicates the “agrees up to expected floating-point
round-off” so-called problem.

I assume that this is running on 1.3.1 with (the default) single
precision (.float()). Note, that your third result (from autograd)
agrees up to single-precision round-off with the first two. This is
really the best you can expect.

Now I assume this is 1.3.1 with double precision (.double()).
Your results still agree up round-off error, but this time up to the
more accurate double precision.

Yes, this is the best you can expect. It should not be viewed as
something being wrong, or any kind of issue (other than that’s
how floating-point arithmetic works).

I wouldn’t say that this is “because of … autograd.” You do the
gradient calculation twice – once your way, and once autograd’s
way. Neither one is more right or wrong than the other. They
just happen to perform certain calculations in different orders.
The different orders are mathematically equivalent, but differ
when using floating-point arithmetic. So you get differences at
the level of floating-point round-off.

But, again, it’s not autograd, per se. You could, yourself, calculate
the gradient twice, but purposely reorder some of the calculations
in a mathematically equivalent way, and you would see similar
differences.

Best.

K. Frank

pytorcher · May 27, 2020, 2:38am

That all makes sense. I just would have thought that at least doing the same calculations, e.g. two runs of autograd, with the same numbers would always do the same rounding. Now that I am using double() correctly I was able to create a test that is repeatable to 8 decimal places over my training epochs. I will use the migration guide to redo the changes I made with the custom functions and be especially careful about .data(). Thanks so much for all the help!