I have a large model which involves sigmoid and logsigmoid functions. I am using float64 for better accuracy with torch.set_default_dtype(torch.float64).
Running the same code twice, I got different training results. I printed the loss to see what happened. I found out the losses are the same for the first few hundred iterations. Then there are tiny difference (after the 10th decimal place). Eventually, the difference becomes so large that the training is not reproducible.
In my case, there are random generation. However, I don’t think that is the problem since I use random generator with a fixed seed.
Hence, I think the reason is the accumulation of precision loss. How can I control the precision loss? In general, is it true that the same code will produce different results even there is no randomness involved?
What confuse me is that shouldn’t the precision loss be deterministic (within the same machine)? and thus leading to the same result?
Yes, the code could result in small differences due to the limited floating point precision and since floating point operations are not associative. The used algorithm might not be deterministic (you can enable deterministic algorithms on the GPU for a potential performance penalty) and could thus result in these mismatches. Note that both results are still valid and your initial one is not closer to the theoretical output than the next run.
The structure of the network I was using is a simple one, but quite deep. It contains mostly linear layers with the weight matrix masked by some buffered binary matrix. It seems like none of the operation could be random.
Well, MaxPooling is a simple operation that is random on GPU. So maybe linear layer is random as well? Is there a list showing all the random operations on GPU?
What about in CPU? Will the precision loss accumulate and lead to different result (on the same machine but different time)? I am trying to figure out the reason and avoid it. If the irreproducibility is due to accumulation of precision loss, I guess there is no way to get around it?
Again, non-deterministic results are not necessarily caused by any random operation, but are due to the algorithmic implementation using floating point values in different orders.
A simple example is:
Yes, that was what I meant. My bad for using the term “random operation” for nondeterministic implementation.
Back to my original question, suppose we control all operations such that only deterministic implementation is used, will the precision loss be different and accumulate to different results on the same machine across time?
For example, a = f(...), the theoretical computation result for a is 1/3, by using the same code on the same machine, will we sometimes get a = 0.33···111 and sometimes get a=0.33···222? (the different digits are due to precision loss), so that further computation carry the difference in precision loss and accumulate to significant different results?
Based on my limited tests, if only deterministic implementation is used, then results will be the same across multiple executions, that is, the precision loss is deterministic and will not lead to different results across multiple executions (on the same machine). To use my toy example above, if a = f(...) results in a=0.33···222, then it will always be a=0.33···222.
Note that this conclusion is based on my limited tests, it could be wrong in general. To understand the behavior for sure, it requires more knowledge on floating number and hardware, which is out of the scope of this topic.