Scheduling Forward and Backward in separate GPU cores

albanD · March 3, 2020, 4:28pm

But you still need to do the forwards. But if you want to use multiple devices, you can just schedule fw/bw pairs on the different devices. Just that the faster devices will do more work.

Vibhatha_Abeykoon · March 3, 2020, 4:34pm

About this, do you mean;

layer1.to(‘cuda:0’)

and

layer2.to(‘cuda:1’)

I already tried this one, it works fine actually.

But more performance can be gained from an approach where more backward work being done by more devices. IMHO I am not quite sure, how hard or how practical this task is with the existing API.

albanD · March 3, 2020, 4:47pm

how hard or how practical this task is with the existing API.

I would say very very hard. The autograd engine is really not built to do that.

About this, do you mean;
layer1.to(‘cuda:0’)
and
layer2.to(‘cuda:1’)

Not really. If you just want to split work accross GPU, we have a builtin module to do that: DataParallel.

If your devices are very different, it might not be optimal and what I had in mind was more something like:

Get your full batch
Split it in micro batch
Add to a queue each micro batch
Have one worker (threads) per device that you have
Each worker pull the micro batch from the queue, perform the fw/bw on its device then accumulate the gradient on the original net (using a lock to make sure you get the right values there)

That way faster device will have their worker finish more work and all device will be fully used.

Vibhatha_Abeykoon · March 3, 2020, 4:52pm

I understand your point. I have tried “DataParallel” as well. Overheads show that more could be done.
But I understand the API constraint. Thanks a lot for clarifying this.