There are a few possible 'why?':
- what is the technical underlying reason?
- why is it like this?
The technical underlying reason is that anything that causes a 'read' of an actual concrete value from the gpu causes a sync point. Operations returning torch tensors dont necessarily force sync points. However reduce all, in its current implementation, returns a scalar float, rather than a tensor. This forces a sync point.
Why does reduce all return a scalar, rather than a tensor? I dont actually know but I guess some combinatin of:
- maybe torch was written before gpus were widely available, and on the cpu, making reduce all return a float seems not unreasonable?
- torch is written by conv net guys, and for conv nets, reduce all causing a sync point is almost un-noticeable, in practice