Reproduce Training Question

cy-xu · May 1, 2019, 11:38pm

Dear Community,

I’m trying to reproduce a simple over-fitting experiment on autoencoders, I got matching initialization under control but the loss and results would diverge over time.

Focus on the first 10 steps, we could see they are initualizaed the same.

Then as training goes further, the curves will diverge a lot more

I found multiple discussions on reproducing and deterministic here already. For example [1], [2], [3], [4]

From the document we have REPRODUCIBILITY [4], MULTIPROCESSING BEST PRACTICES [5], Asynchronous execution [6].

I have followed all the suggestions and confirmed that I got matching random number and matching initialized weights (diff all the weights and 100% match) in my modules between tests. But my results always end up different.

Combining all previous discussion, here is what I did, at the beginning of my main.py

import numpy
numpy.random.seed(123)
import random
random.seed(123)

import torch
import torch.nn as nn
import torch.optim as optim
import torch.backends.cudnn as cudnn
from torch.autograd import Variable

torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.manual_seed(123)
torch.cuda.manual_seed(123)

In data loader, num_workers=0. There are only two training samples so no shuffle there, flipping turned off.
In the terminal, I set CUDA_LAUNCH_BLOCKING=1 python ./main.py

Any other non-deterministic behavior during convolution or back propagation that we need to consider?

Thank you!

Just tried replacing Adam with SGD and the problem remains.

tom · May 3, 2019, 6:14am

There is a rather long term plan outlined in issue #15359. We would first flag non-deterministic behaviour and then one could switch to deterministic implementations as desired.
It is, however, hard work, and seems to be one of those things everyone wants to have and no-one wants to do.

Best regards

Thomas

cy-xu · May 3, 2019, 6:34am

Hi Tom,
Thank you for your reply.

I read your answer to this issue, regarding non-deterministic behavior during backward(), which is exactly where I located the issue at the moment. All values remain unchanged in forward propagation and the gradient difference appears in both encoder and decoders.

Though I don’t understand how my test machine with a GTX 980T would display 1e-10 or even 1e-12 difference for the gradients, should I trust such minimal difference? Certainly I can’t ignore them as the difference in weights would grow bigger later.

A rookie question - I reproduced example code form pytorch to make sure nothing else caused the problem. But I found the encoder and decoder pair from here caused most trouble after backward().

I applied all available deterministic methods mentioned in the post, by reducing the conv and res layers in the code I mentioned above, I could see less occurrence of divergence but most of the gradient difference are at level of 1e-4 or 1e-3, isn’t that too big too ignore?

But thank you for your feedback and great work.

Best reagards,
CY

tom · May 3, 2019, 12:18pm

Well, ideally your learning procedure should be robust in the sense that you’d get comparable results from two runs despite the numerical differences.

If you want to have reproducible learning, you need to avoid (or fix!) those functions that introduce the non-linearity. The documentation has a best-knowledge list (I think some particular 2d cross entropy thing is nondeterministic in the forward missing, there is a bug report about that).

If you have one particular function that you want to fix and are willing to dig into that, it’d certainly be interesting, too. For example Chang Rajani in #17501 looking at improving the perfomance and eliminating nondeterminism from torch.fold. Most of the time, function use a cuda function atomicAdd which allows multiple threads to add to the same location. As the order isn’t deterministic, this blows up. There are relatively standard techniques to avoid the atomic add (e.g. changing the order of processing things such that only one thread writes to any given gradient value), but it does take a bit of work.

Best regards

Thomas

cy-xu · May 3, 2019, 7:23pm

Hi Tom, I’m playing around to see how the combination, number and choice of conv layers and residual blocks would affect the results. See if I will find a compromise as a solution for my current network.

Thank you!

Best,
CY