Questions about how a computational graph is executed

Hi there,

I am new to the computational graph in pytorch and reading the blog, How Computational Graphs are Executed in PyTorch, and have several questions about it.

So the first question is about why we need threads in the section, GRAPH TRAVERSAL AND NODE EXECUTION. According to the article, the engine will initialize one thread per device, except for the main thread executing the python interpreter. Does this mean that a thread would exchange data with other threads when computing gradients, that is, executing a computational graph on that thread? Or there is no data exchange, can we say that we have one CPU and one GPU and we don’t move data/tensors to the GPU, the thread on the GPU will be idle and the main thread on CPU would be responsible for executing the computational graph(s)?

The second question is about the ReadyQueue. As far as I know, each thread would have a thread-local ready queue and starts executing a computational graph when the main thread enqueues a RootNode/task into that queue, and then the thread computes gradients as the Figure 5, Animation of the execution of the computational graph, shown. However, the blog also mentions that

Also, if the rest of the graph is not on the cpu, we directly start on that worker while the RootNode is always placed on the cpu ready queue.

I just don’t get it that how does the cpu ( or the main thread ) know what is the RootNode for a graph that is not on it? Or the cpu builds the graph, record its RootNode, and move the graph to a GPU? Or something like that?

The last question is about the order of enqueuing nodes when a thread executes a computational graph. Take the one in the Figure 5 as an example. Assume there is no data exchange between threads, like between two GPUs or a CPU and a GPU when executing a graph. After the thread_main executes the graph root, is the order of enqueuing its child nodes, LogBackward and SinBackward, also guaranteed as it’s shown in the animation, that is, enqueue LogBackward alwasy before SinBackward? Or it just depends the actual implementation? And what would this happen if there is data exchange between threads? Just curious.

Regards,

A thread can exchange data with other threads between computation of nodes, e.g. if I did some compute in forward on one gpu, moved it onto another gpu then did some more compute. After thread1 finishes compute of backward of the second compute, it enqueues the gradients onto the first device/thread’s queue.

Or the cpu builds the graph, record its RootNode, and move the graph to a GPU? Or something like that?

Yes, the graph is built as forward is computed. The root node is simply the node passed to backward.

The last question is about the order of enqueuing nodes when a thread executes a computational graph.

If there is a single thread, the order is based on (1) reverse of the original forward + (2) accumulate grad nodes have priority

Hi soulitzer,

Thanks for your explanation.

So in other words, the order of enqueuing nodes is deterministic when there is one thread while it is indeterminism if there are multiple threads and they are exchanging date, like the case you mentioned?

Also, are there in-depth materials about the two points, reverse of forward and grad priority, that I can look into?

regards,

So in other words, the order of enqueuing nodes is deterministic when there is one thread while it is indeterminism if there are multiple threads and they are exchanging date, like the case you mentioned?

Yes.

Also, are there in-depth materials about the two points, reverse of forward and grad priority, that I can look into?

Not sure, but some more info - sequence_nr is incremented on per thread basis, new threads begin from 0. You can query b.grad_fn._sequence_nr() to observe this value. Grad acc is prioritized by setting its sequence_nr to int max.

Got it, thanks for your time and patient! I’ll dig into what you just mentioned.