Reduce_sum() or sum operation (reduction operation) for floating point ops isnt accurate or not implementing kahan summation algorithm ,instead just returning a+b --->leading to vanishing gradient problem

In your reduction operations,mainly in torch.sum() or x.sum() ;(x is a tensor) or sum operation —>not implementing kahan sum operation ,instead just returning a+b,leading to vanishing gradient problems.
lets say in fp32,in the binade 2^24,as the gap btn two adjacent representable numbers is 2,2^24=16777216,so if i do 16777216+1 —>it will return 16777216 again,but if i do 16777216+1+1—>it shpuld return 16777218,as 16777218 can be represented in fp32,but 16777217 cant.
Where it matters :
this might pose a big problem for vanishing gradients :
lets say : 100+0.0001+0.00001+0.001 +… like in batch normalisation ,or loss calculations ,where we use reduce_sum,those small gradients contributions will be vanished.

Does this affect the speed? and what about computation?
Lets go theoritically:
kahan sumamtion requires 4x more arithmetic ops than naive sum operation which just returns a+b,but again its memory-bound too.(processor literrally waits to get data from RAM for most reduction-ops).

For small tensors ,vectorization may costs time,but for deeplearning applications ,for large tensors—>the memory bottleneck overcome these arithmetic costs.

Image

A few inputs about this:

  1. Gradients (especially the large sums happening within matrix multiplication) are typically not non-negative, like your example shows. If you have a sum with random positive and negative numbers of the same approximate magnitude, kahan summation will not significantly improve things. I realize that you are specifically mentioning torch.sum() but most models don’t perform large reductions in the gradients calculation, because this would mean a large broadcast during forward propagation. Can you provide a concrete example of a model where this is happening?
  2. Most summation for gradients on most hardware (multi-core CPUs or GPUs) is already grouped: e.g. on a GPU, torch.sum would be implemented s.t. each thread only performs very few sequential additions, with most of the summation happening in a hierarchical/tree-like way. This is not the same as kahan summation but has better numerical error than naive approaches.
  3. You are right that for something like torch.sum, on most hardware, kahan summation should not significantly impact performance, but it does require a bit more energy and the benefits aren’t obvious given that there are many techniques available to ensure that gradients are distributed around 0 (e.g. follow a normal distribution), which address the vanishing gradient problem already. This is likely why there hasn’t been an ask for this before. I think you could open a github issue to request this feature if you believe that there are important use cases benefitting from it.