Does DCG compute asynchronously?

Hello, I profiled my code using cprofile, but it seemed weird. One operation (sum) took most of execution time. And when I add time.sleep(0.02) before the operation, the time taken by the operation was decreased. I suspect that is because the operation waits all pending computations to be done. Is it one of the DCG properties?

Am I right in thinking the sum was a full reduction? Its because it implicitly copies the reuslt hostside, causing a sync point.

By the way, you are right that gpu operations are async by defaut, in the absence of any kind of sync point, such as reading data to hostside.

1 Like

Yes, sum is full reduction. Thanks for clear answer.

Then are cpu operations synchronous?

EDIT: Why does full reduction copy the result to the host?


The CPU operations are synchronous. Only the GPU operations are asynchronous.

The full reduction returns a number, and to be able to return this number, it has to wait for the computation to be done.

1 Like

There are a few possible ‘why?’:

  • what is the technical underlying reason?
  • why is it like this?

The technical underlying reason is that anything that causes a ‘read’ of an actual concrete value from the gpu causes a sync point. Operations returning torch tensors dont necessarily force sync points. However reduce all, in its current implementation, returns a scalar float, rather than a tensor. This forces a sync point.

Why does reduce all return a scalar, rather than a tensor? I dont actually know :slight_smile: but I guess some combinatin of:

  • maybe torch was written before gpus were widely available, and on the cpu, making reduce all return a float seems not unreasonable?
  • torch is written by conv net guys, and for conv nets, reduce all causing a sync point is almost un-noticeable, in practice
1 Like

That’s exactly what I wondered. I didn’t even know sum returns float.