How does pytorch computes on GPU?

Hi, I have been writing a deep learning library. I think a lot about how the way of gpu and cpu computing should be and wonder how pytorch computes actually and achieves building the fastest library. Before continuing on telling, I want to say that I tried to find the answer by looking at the codes of CNTK. However, It’s so complicated, I couldn’t understand at all. So after CNTK, I gave up searching for pytorch codes.

My library has Layers which all have sequence length as outer dimension, like below. Each box of a layer is called a Term.

One way is, I can continue with one main thread on the cpu and parallelize computing of a term in gpu. This way, I only compute one Term at a time. However, that term may not fill the entire gpu load. So gpu utilization could be so low.

The second approach I thought is, Whenever I am to compute a Term, first I compute the necessary Terms that my Term will need, simultaneously. Then after computing the necessary terms, I can use them and compute my term. This way, I can push more than one Term to Gpu to compute. But there might a problem here as well which I don’t know for sure.

I wonder how pytorch computes those. Can someone just give a little idea?

Thank you