In your reduction operations,mainly in torch.sum() or x.sum() ;(x is a tensor) or sum operation —>not implementing kahan sum operation ,instead just returning a+b,leading to vanishing gradient problems.
lets say in fp32,in the binade 2^24,as the gap btn two adjacent representable numbers is 2,2^24=16777216,so if i do 16777216+1 —>it will return 16777216 again,but if i do 16777216+1+1—>it shpuld return 16777218,as 16777218 can be represented in fp32,but 16777217 cant.
Where it matters :
this might pose a big problem for vanishing gradients :
lets say : 100+0.0001+0.00001+0.001 +… like in batch normalisation ,or loss calculations ,where we use reduce_sum,those small gradients contributions will be vanished.
Does this affect the speed? and what about computation?
Lets go theoritically:
kahan sumamtion requires 4x more arithmetic ops than naive sum operation which just returns a+b,but again its memory-bound too.(processor literrally waits to get data from RAM for most reduction-ops).
For small tensors ,vectorization may costs time,but for deeplearning applications ,for large tensors—>the memory bottleneck overcome these arithmetic costs.
