I am trying to convert code from an old pytorch version on an old computer to a new one. I have it very close, but it is still not getting quite the same results. Im having trouble determining where actual changes are happening and where it is just tiny floating point differences. I already have converted my entire system to .double() but that actually seems to yield larger problems than without it. Any suggestions?

To check for tiny floating point errors

```
def test_close(a, b, eps=1e-5):
# Now depending on what a, b are you can add code here
return abs(a-b)<eps # If scaler
return (abs(a-b)<eps).all() # If array
```

That finds the differences, the problem is I donâ€™t know when the differences are because of floating point rounding and when they are because there is actual functionality differences.

Hi Py!

First, donâ€™t look at differences in your results after training for a

number of (batch) iterations. It is perfectly possible for your

model parameters to wander off in distinctly different, although

content-wise, broadly-equivalent directions, after accumulated

round-off errors start them down different paths.

You should look at whether your differences are compatible with

round-off error after passing a single batch through your model,

checking your output, your loss, and your gradients after

backpropagating once. If you still think these differences exceed

reasonable, somewhat accumulated round-off error, trace

intermediate steps through the layers of your model. Unless you

have an ill-conditioned `Linear`

in there somewhere, no single

layer should give results that differ by more than a few times

round-off error.

The next general check you can make is to move your model and

all of your data to double-precision. Round-off error in your results,

even after accumulating somewhat, should be reduced such that

you have about eight or nine additional decimal digits of precision.

You comment that converting your â€śentire system to .double() â€¦

seems to yield larger problems than without it.â€ť Is this after

training multiple iterations after your parameters may have started

wandering down different paths? Or is this still the case after only

a single forward and/or backward pass through your system?

Best.

K. Frank

Hi K. Frank!

I am looking at values after single batches. To try to fix this for the more complicated project I am using a model is very simple, its just a conv layer, a sigmoid, and another conv. I am printing the exact weights and bias as well as the grad for every weight after each batch as well as ensuring they initialize to the same values. It gives the same up to 16 decimal places for the first few batches but then drift starts to accumulate. With double precision the drift started immediately. Maybe I am not using .double() in the correct places? I basically just made everywhere I was doing .cuda() to now be .cuda().double().

Hi Py!

This is strong evidence that your two versions are doing the same

thing, up to floating-point round-off. You canâ€™t expect any more

than this.

(As an aside, 16 decimal places is what you would expect for

double precision. So youâ€™re actually doing better than you would

expect for single precision. This behavior is actually a bit odd,

but I can cheerfully concoct some plausible explanations for it.)

Whether this is a concern depends on the details of what you

mean by this. After a *single* forward or backward pass, do your

double-precision results agree to significantly less than 16 decimal

digits? If not, then youâ€™re getting as much as you have an right

to get.

Best.

K. Frank

Yes, after a single forward pass I get different results to less than 16 digits. I am not initializing to different random weights or passing in data in a different order. So I print out every weight in my network and and running them in the diff tool meld, and get exactly the same values up to 16 digits. But then on the very first forward pass I get losses of 1.9986734118816107 and 1.9986734009655582. When I was only using float precision things did always stay exactly the same for a few batches before the drift started. I get what you are saying in the other thread that this will happen and they will converge to equally correct minima. But I am only debugging with this very simple model, the one I am actually trying to correct once I figure this out is very complicated and uses lots of funky custom functions. The evidence something is wrong is that with the real model everything works reliably in pytorch 0.3 and does not work at all in 1.3.1.

Hi Py!

If I understand correctly, this happens when you are running everything

(as far as you know) in double precision.

This is odd. Your two results agree up to single precision, so itâ€™s not

as if one version is *completely* broken. Itâ€™s as if one version does

something only in single precision, even though it should be in double

precision.

Do I understand correctly (based on your earlier post) that this means

that the two versions agree (for a few batches) up to *double* precision,

even though you are running in single precision?

This is also odd (although not â€śwrongâ€ť or impossible). Am I right

that you are saying that when running in single precision, the two

versions agree up to double precision, but when running in double

precision, the two versions only agree up to single precision (at least

for the first batch)?

Could you post a simple, complete, runnable script that illustrates

this issue, together with its output?

What does â€śdoes not work at all in 1.3.1â€ť mean? It sounds like (in

single precision) 1.3.1 agrees with 0.3 up to double precision. So

that sounds like itâ€™s working. (And in double precision, they agree

up to single precision, so itâ€™s more-or-less working.)

Also, could you tell us the exact version of your 0.3.0 version?

(You can run `print (torch.__version__)`

.)

Best.

K. Frank

That is the correct understanding of the two numbers.

That is also correct, with float precision the two versions agreed for a few batches up to 32 decimal places when I was not using double precision. and yes when I switch to double it seems that they agree to 32 decimal places when I specify exact values, but then as soon as any operations are done, forward pass, backward grad or loss calculations, they disagree immediately.

Does not work at all in 1.3.1 means that the built in pytorch functions work properly and and it will converge to a slightly different local minima without using my custom functions, but when it uses my custom functions it makes no difference at all while it used to in 0.3. The old version is 0.3.0.post4.

It looks like in 1.3.1 it actually is not giving consistent values to begin with. When I run the same code initialized with the same values I can print the network weights, network output, loss value all to be exactly the same to 16 decimal points but then actual grads that get calculated only agree up to 7. I can try to make a minimal example to show code if that would help. My understanding is that floats might lose some precision, but the same math on the same system should always do the same rounding and give the same values.

Here is what Iâ€™m doing for debugging now, the second grad value changes every time past the first 7 digits while the first manually calculated grad is identical every time to all digits printed:

```
(Pdb) X = outputs
(Pdb) y = targets
(Pdb) m = y.shape[0]
(Pdb) grad = F.softmax(X,1)
(Pdb) grad[range(m),y] -= 1
(Pdb) grad = grad/m
(Pdb) ((net.layer1.module.layer1(inputs)).sigmoid()[:,0,0,0] * grad[:,0]).sum().tolist()
0.03555078059434891
(Pdb) (net.layer2.module.layer1.weight.grad[0][0][0][0]).tolist()
0.03555076941847801
```

Edit: I noticed a pattern while running this many times and decided to see if there was one. It turns out after 225 trials this second value only ever takes on 8 different values. It is possible somewhere in the pytorch code for back propagating cross entropy loss there are 3 bits that are not being set to any particular values so they get randomly assigned?

Eidt2: This seems to be based on batch size. 8 Values was with batch size of 128, with batch size 64 there are only 7 values, with batch size 16 down to 3, batch sizes 4, 2 and 1, consistently return the same correct value

Hi Py!

If I understand properly what you are saying, it sounds like you

have this issue only when you are using custom functions and

you are running on 1.3.1.

This suggests that there is a bug in 1.3.1 that is uncovered in

by your custom functions. Or, I suppose, there could be a bug

in your custom functions that 0.3.0 is letting you get away with.

Could you post the code for your custom functions?

A minimal example would be the best thing. A complete, runnable

script (with all necessary imports, hard-coded or randomly generated

sample, and anything else needed so that it can be run just by

copy-pasting it from the forum) would be very helpful.

This makes it sound like the bug â€“ whether yours, or in 1.3.1 â€“ is

associated with the `.backward()`

of one of your custom functions.

A minimal example â€“ that will presumably contain the code for

your custom functions â€“ would be the best next step. Please

also post the output produced by your minimal example, when

run on both 0.3.0 and 1.3.1.

Also, Iâ€™m not saying that you should necessarily do this, but have

you tried upgrading to pytorch 1.5 (or maybe 1.4) to see if this

issue still persists?

[Edit:] One possibility: Are you using `.data()`

in your custom

functions. `.data()`

is deprecated since 0.4.0, and, although

still present, doesnâ€™t work correctly for all use cases. Please

see the Pytorch 0.4.0 Migration Guide for details.

Best.

K. Frank

I gave the Migration Guide a read. Thanks for sending me that! I do suspect once I get to the bottom of this something from that guide will be what I did wrong since my custom functions were using .data in 0.3 and 1.3 versions. However, when making the minimal example my custom functions were not needed to replicate this problem. The following code also shows that the hand calculated grad does not match up with the autograd calculated one in 1.3.1. I made similar code in 0.3.0 and it also does not match there.

```
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms
from torchvision import datasets
class Small_Net(nn.Module):
def __init__(self, device):
super(Small_Net, self).__init__()
self.layer1 = nn.Conv2d(1, 40, kernel_size=28, stride=1, padding=0, bias=True)
self.layer2 = nn.Conv2d(40, 10, kernel_size=1, stride=1, padding=0, bias=True)
self.layer1.to(device)
self.layer2.to(device)
def forward(self, x):
out2 = self.layer1(x).sigmoid()
self.midOut = out2.detach().clone()
out3 = self.layer2(out2)
out6 = out3.view(out3.size(0), -1)
return out6
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net = Small_Net(device)
net.to(device)
net = torch.nn.DataParallel(net, device_ids=range(torch.cuda.device_count())).to(device)
start_epoch = 0
trainloader = torch.utils.data.DataLoader(datasets.MNIST('../mnist_data',
download=True,
train=True,
transform=transforms.Compose([
transforms.ToTensor(), # first, convert image to PyTorch tensor
transforms.Normalize((0.1307,), (0.3081,)) # normalize inputs
])),
batch_size=128,
shuffle=True)
criterion = nn.CrossEntropyLoss()
def train(epoch, mode):
net.train()
optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1, momentum=0, weight_decay=0) # 5e-4)
for batch_idx, (inputs, targets) in enumerate(trainloader):
batchSize, outputMaps, x, y = inputs.size()
inputs, targets = inputs.to(device), targets.to(device) # GPU settings
optimizer.zero_grad()
output = net(inputs)
loss = criterion(output, targets)
loss.backward()
X = output
y = targets
m = y.shape[0]
grad = F.softmax(X,1)
grad[range(m),y] -= 1
grad = grad/m
print('is same?')
print(str((net.module.layer1(inputs).sigmoid()[:,0,0,0] * grad[:,0]).sum().tolist()))
print((net.module.midOut[:,0,0,0] * grad[:,0]).sum().tolist())
print(str((net.module.layer2.weight.grad[0][0][0][0]).tolist()))
del inputs, targets, output
optimizer.step()
train(0, 'n')
```

Example output:

is same?

-0.0011598279234021902

-0.0011598279234021902

-0.0011598295532166958

Edit: with double precision it does agree much more but still not the same. Is this just the best that can be expected because of what autograd is doing under the hood?

is same?

-0.00020605623917282047

-0.00020605623917282047

-0.00020605623917282427

Hi Py!

My conclusion is that your use of `.data()`

is the cause of â€śdoes

not work at all in 1.3.1.â€ť

Note, what you show below does not replicate the "does not work at

all " problem â€“ it replicates the â€śagrees up to expected floating-point

round-offâ€ť so-called problem.

I assume that this is running on 1.3.1 with (the default) single

precision (`.float()`

). Note, that your third result (from autograd)

agrees up to single-precision round-off with the first two. This is

really the best you can expect.

Now I assume this is 1.3.1 with double precision (`.double()`

).

Your results still agree up round-off error, but this time up to the

more accurate double precision.

Yes, this is the best you can expect. It should not be viewed as

something being wrong, or any kind of issue (other than thatâ€™s

how floating-point arithmetic works).

I wouldnâ€™t say that this is â€śbecause of â€¦ autograd.â€ť You do the

gradient calculation twice â€“ once your way, and once autogradâ€™s

way. Neither one is more right or wrong than the other. They

just happen to perform certain calculations in different orders.

The different orders are *mathematically* equivalent, but differ

when using floating-point arithmetic. So you get differences at

the level of floating-point round-off.

But, again, itâ€™s not autograd, per se. You could, yourself, calculate

the gradient twice, but purposely reorder some of the calculations

in a *mathematically* equivalent way, and you would see similar

differences.

Best.

K. Frank

That all makes sense. I just would have thought that at least doing the same calculations, e.g. two runs of autograd, with the same numbers would always do the same rounding. Now that I am using double() correctly I was able to create a test that is repeatable to 8 decimal places over my training epochs. I will use the migration guide to redo the changes I made with the custom functions and be especially careful about .data(). Thanks so much for all the help!