Scheduling Forward and Backward in separate GPU cores

But you still need to do the forwards. But if you want to use multiple devices, you can just schedule fw/bw pairs on the different devices. Just that the faster devices will do more work.

About this, do you mean;

layer1.to(‘cuda:0’)

and

layer2.to(‘cuda:1’)


I already tried this one, it works fine actually.

But more performance can be gained from an approach where more backward work being done by more devices. IMHO I am not quite sure, how hard or how practical this task is with the existing API.

how hard or how practical this task is with the existing API.

I would say very very hard. :confused: The autograd engine is really not built to do that.

About this, do you mean;
layer1.to(‘cuda:0’)
and
layer2.to(‘cuda:1’)

Not really. If you just want to split work accross GPU, we have a builtin module to do that: DataParallel.

If your devices are very different, it might not be optimal and what I had in mind was more something like:

  • Get your full batch
  • Split it in micro batch
  • Add to a queue each micro batch
  • Have one worker (threads) per device that you have
  • Each worker pull the micro batch from the queue, perform the fw/bw on its device then accumulate the gradient on the original net (using a lock to make sure you get the right values there)

That way faster device will have their worker finish more work and all device will be fully used.

1 Like

I understand your point. I have tried “DataParallel” as well. Overheads show that more could be done.
But I understand the API constraint. Thanks a lot for clarifying this. :+1: :+1: :+1: