Scheduling Forward and Backward in separate GPU cores

Vibhatha_Abeykoon · February 25, 2020, 12:38am

Hi,

Is there a specific scheduling capability where I can define forward pass and backward pass to run on specific GPU devices? With model parallelism, it is clear that we can schedule the layers to be in specific GPU devices. But could we go further into the details? Without overriding, the backward function is it possible to give a config and train a neural network? This could involve a weight copy or activation copy to a separate GPU device to do the backward. Is this something possible with Autograd definitions and GPU scheduling in Pytorch?

albanD · February 25, 2020, 3:39pm

Hi,

No this is not possible at the moment I’m afraid.
What the engine does is to run the backward of a function on the same device (and cuda stream if it was on GPU) as the forward pass. This is a good heuristic as the backward pass is usually as expensive as the forward. So the splitting used in the forward is a good indication of how the backward should be split.

Vibhatha_Abeykoon · February 25, 2020, 3:51pm

@albanD

Exactly. But if we not use backward and do all in manual passion, this is possible with
multiple cuda streams? Is this possible?

albanD · February 25, 2020, 3:56pm

But if we not use backward and do all in manual passion,

Not sure what you mean by this.
If you don’t use the autograd engine and write everything by hand, you can do whatever you want as you write it by hand.

Vibhatha_Abeykoon · February 25, 2020, 4:42pm

Yes that is true.

What I meant was using autograd function, on a given tensor and get it’s gradient update value.
Generally what we do is, we have this leaf node or loss value and call backward().

So, it traverse back and calculate all gradients for us. Is this right?

If so, I can access w.grad and get the weight updates.

What if I manually go back and just go layer by layer by calling corresponding grad_function of each tensor (contributing to weights) by passing shape-matching inputs and get the grad value.

This way, I get more control over it. I don’t want to entirely loose the autograd functionality, but use it at node level to just get the corresponding weight vector update. Can this be possible?

albanD · February 25, 2020, 5:04pm

I guess if you save every input/output pair for every function, you can do this by doing. If you have b = foo(a) and c = bar(b)
grad_b = autograd.grad(c, b, grad_c)
grad_a = autograd.grad(b, a, grad_b)
And do that for every op if you have more. That will work, but might not be super efficient.

Vibhatha_Abeykoon · February 25, 2020, 5:07pm

Your function representation is exactly what I explained.

One point to understand, why this could be very inefficient?

Inefficient in a sense of more time to write code or execution time will be slow because of not using
some of the internal optimizations in Autograd?

albanD · February 25, 2020, 5:58pm

Inefficient at execution time because the autograd engine has some overhead (to discover the work that needs to be done) that is going to be repeated many times here. Also usually, the backward pass runs only in cpp and without holding the python GIL, here, you will come back to python after each operation.

Vibhatha_Abeykoon · February 25, 2020, 6:04pm

Great!. So if I look into a neat CPP implementation, I could not worry on the Python overhead, but still I have to deal with the overhead of autograd engine.

Could you please elaborate a bit on the overhead?

albanD · February 25, 2020, 6:55pm

This overhead is mainly the discovery of what needs to be done to compute gradients. So it needs to traverse all the graph of computation, which takes a bit of time.

Note that if you’re simply experimenting, this overhead won’t kill you. But it won’t be 0.

nirandaperera · February 25, 2020, 7:06pm

@albanD I have a follow up question to this. So, AFAIU forward and backward pass happens inside the GPU device without holding the python GIL.

But how about the weight update operation done by an optimizer? Would optimizer access grads and params of each layer in the python environment, update the weights and push them back to the GPU? or would you translate the python logic to a cuda kernel and update the weights inside the GPU itself?

albanD · February 25, 2020, 7:22pm

These are quite orthogonal things.
If you’re using python, the Tensor is a python object. But if the Tensor is a GPU tensor, then the memory it works with is on the GPU. So when you perform operations on it, they will run on the GPU.

This discussion here about the GIL is unrelated. You can use the underlying cpp Tensor object and use it without touching the python object as well.

Vibhatha_Abeykoon · February 25, 2020, 7:37pm

@albanD I think this clears things.

nirandaperera · March 3, 2020, 2:19pm

Hi @albanD,
How much of work would it be to separate out forward and backward passes?

albanD · March 3, 2020, 2:35pm

Like run the forward on one GPU and the backward on another?

nirandaperera · March 3, 2020, 2:52pm

Yes, on separate GPUs.

albanD · March 3, 2020, 3:15pm

That would be a LOT of work

What would be the benefit of doing this?

Vibhatha_Abeykoon · March 3, 2020, 4:09pm

@albanD

For the sake of the argument of reducing the computation overhead in the backward,

If we can divide this into parts, the forward and the backward, can’t we get better performance.

Which APIs would be involved in such a task? How complex could this task be?
Could you give insight with respect to the existing APIs?

albanD · March 3, 2020, 4:19pm

I don’t see how you can get better performances.
The forward and backward are sequential.
And if you say you have 2 device and you want to overlap multiple forward backward. Then you can overlap them while keeping each fw/bw pair on the same device as well.

Which APIs would be involved in such a task? How complex could this task be?
Could you give insight with respect to the existing APIs?

The whole backward pass assumes that intermediary gradients match the intermediary results in the forward pass. Also if some buffers were saved, they would need to be moved from different device.
This will for sure not be accepted as a PR (unless you can show a use case where it brings massive speed improvement). And hacking it might still be quite hard.

Vibhatha_Abeykoon · March 3, 2020, 4:24pm

What if we use pipeline mechanism to load micro-batches instead of mini-batches?
And do the final weight vector update at the end of the mini-batch.

Then that sequential relationship can still be maintained (as it cannot be broken because of weight vector update requirement). But when backward takes more time, more GPUs’ can do backward task and reduce backward timing. Isn’t this a practical use-case where we could improve performance?