Possible solution to the reproducibility issues using scatter_add operation

amosella · June 26, 2019, 9:05am

As it is well known to all of us some pytorch operations are undeterministic such as scatter_add operation and some times it can be annoying have different results using the same hyperparameters in different runs. Recently I found this article that describes a methodology to obtain a deterministic behaviour in Floating-Point Summation. I’m wondering if it can be useful implement this in pytorch. What do you think?

tom · June 30, 2019, 11:03am

I don’t think it’s the algorithms as much as the lack of people doing it.
For example, for many operations (Upscaling comes up frequently) it would be relatively easy to make the cuda backward deterministic by having each grad_in location treated by one thread.
For others, using a index_add/scatter_add-style operation, the Embedding backward may serve as a blueprint.

Best regards

Thomas

amosella · July 1, 2019, 5:42am

Hi @tom thanks for your answer.

I don’t think it’s the algorithms as much as the lack of people doing it.
For example, for many operations (Upscaling comes up frequently) it would be relatively easy to make the cuda backward deterministic by having each grad_in location treated by one thread.

I see what you mean. It is not a problem of knowledge it is more a human resources problem, right? However, in my opinion, should be a priority to have a deterministic mode for all the current algorithms. I know that there is a general idea that non-deterministic algorithms do not have a lot of influence on the results, but, in some cases, there is a huge impact. For instance, I started this thread because in pytorch_geometric some of the operations are based on non-deterministic implementations of pytorch. As you can see in this github issue this behavior is impacting the results. At the end this non-deterministic behavior has an impact in other libraries like pytorch geometric, causing the lack of reproducibility.

For others, using a index_add / scatter_add -style operation, the Embedding backward may serve as a blueprint.

Can you give me more details about this? I am not fluent in cuda but if I have a blueprint I can try to collaborate to have a deterministic mode in statter_add/index_add. I think the deterministic mode is a must to have.

tom · July 1, 2019, 6:23am

In aten/src/ATen/native/embedding/Embedding.cu, embedding_dense_backward_cuda essentially computes and index_add – the gradient for embedding matrix row i is the sum of all input gradient elements where i was embedded in the forward. (Disregard the scale_grad_by_weight and only consider the false case.)

Two variants are implemented

For small batches (< embedded786 elements in total) a direct variant is implemented in embedding_backward_feature_kernel. That is somewhat elaborate cuda with explicit communication amongst the threads to ensure determinism. If you didn’t do much cuda programming, this will be an opportunity to learn a lot, but at <100 lines of code with ample comments it should be OK.
For large batches one falls back to sorting the indices (this is still done in embedding_dense_backward_cuda) and then hand off to embedding_backward_cuda_kernel in EmbeddingBackwardKernel.cu. Here again, you can assume offset2bag to be undefined. It’ll do quite a few steps to partition the work and then execute it, but it’s implemented as thrust calls and a few fairly simple kernels.

So it takes a quite a bit of work to do this properly, but the ingredients are all there.

Best regards

Thomas

yf225 · July 21, 2019, 9:18pm

@amosella Would you like to post this on https://github.com/pytorch/pytorch/issues, since this is not only related to libtorch? We would be able to prioritize general feature requests there based on community needs and available resources.

amosella · July 22, 2019, 4:53am

@yf225 Thanks for your suggestion. I created this issue.