Using 2 GPUs for Different Parts of the Model

Hi!

I have 2 of the same GPU and I want to achieve faster processing by utilizing both of them. However, I am doing this in a different way, imitating the idea of Massively Parallel Video Networks:

I have divided my model into two sub-models. I want to run them concurrently, one part processing the input video frame by frame, and the other processing the output of the first one. However, there is a catch. When the first sub-model returns an output, it passes it to the second sub-model and starts processing the next frame of the input. By utilizing both the GPUs the authors of the paper achieve faster processing. Any idea on how to do this? The figure shows the idea: (the network is unrolled over time)

Screenshot%20from%202019-12-20%2021-37-25

The idea is not the same as nn.DataParallel(). I have tried torch.multiprocessing, DistributedDataParallel() but I am having trouble understanding how to do this.

If anyone have some answer, I would be glad.

Thanks.

One approach…

Start 2 python programs, in separate interpreters to avoid the dreaded GIL lock.

Processor 1

  1. Put tensor on cuda:0, get the output.
  2. Serialize and push the output to shared redis database

Processor 2

  1. Consumer picks up from database, pushes to cuda:1
  2. Consumer runs the next step of calculation.

If you need to send gradients for backprop you can store and reload them also.

That’s one way… Not easy though. I spent easy a month just trying to distribute calculations over multiple processors.

If you can pull it off… then it’s an awesome skill.

Also, there is the Ray project GitHub - ray-project/ray: Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

I tried using it. It had great promise, but ended up being a bit too new at the time. It might be a bit more mature now.

Thanks for your reply and sorry for my late reply. I will look into these methods. I am only doing this for the test phase, so I will only have to transfer one tensor per input frame to processor 2.

If anybody else has some further suggestions, I will be happy to hear them!

Thanks.

Will this tutorial be helpful? https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

Thank you. I have seen this tutorial previously, however, the model parallel part is not what I want. For the pipelining part, I am having trouble how that part is getting executed. If you can further clarify that part for me, that would be great.

Did you find a solution?
I have the same problem and need to seperate my model into parts and load them on 2 GPUs. But the second part is a frozen LLM and it doesn’t require grad.