Hello, I profiled my code using cprofile, but it seemed weird. One operation (sum) took most of execution time. And when I add time.sleep(0.02) before the operation, the time taken by the operation was decreased. I suspect that is because the operation waits all pending computations to be done. Is it one of the DCG properties?
Am I right in thinking the
sum was a full reduction? Its because it implicitly copies the reuslt hostside, causing a sync point.
By the way, you are right that gpu operations are async by defaut, in the absence of any kind of sync point, such as reading data to hostside.
Yes, sum is full reduction. Thanks for clear answer.
Then are cpu operations synchronous?
EDIT: Why does full reduction copy the result to the host?
The CPU operations are synchronous. Only the GPU operations are asynchronous.
The full reduction returns a number, and to be able to return this number, it has to wait for the computation to be done.
There are a few possible ‘why?’:
- what is the technical underlying reason?
- why is it like this?
The technical underlying reason is that anything that causes a ‘read’ of an actual concrete value from the gpu causes a sync point. Operations returning torch tensors dont necessarily force sync points. However reduce all, in its current implementation, returns a scalar float, rather than a tensor. This forces a sync point.
Why does reduce all return a scalar, rather than a tensor? I dont actually know but I guess some combinatin of:
- maybe torch was written before gpus were widely available, and on the cpu, making reduce all return a float seems not unreasonable?
- torch is written by conv net guys, and for conv nets, reduce all causing a sync point is almost un-noticeable, in practice
That’s exactly what I wondered. I didn’t even know sum returns float.