Using DataParallel with the same GPU multiple times

skk · October 31, 2018, 6:00am

We are using PyTorch in an application where the model forward() is being bottlenecked by CPU speed as well as GPU speed. As a solution, we considered using DataParallel to parallelize batch processing. Although we only have 2 GPUs, we hope to use 8 or even 16 threads to cut down the CPU cost (this should be fine since the GPU usage is not at 100% during forward()).

We have the following line

model = nn.DataParallel(model, device_ids = [0, 0, 1, 1])

which gives the error

  File "/home/kezhang/top_ml/top_ml/engine.py", line 277, in train
    label_outputs=self.model(constituents, transitions, seq_lengths)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 122, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 127, in replicate
    return replicate(module, device_ids)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices

suggesting that GPUs need to be unique for DataParallel to work. Is there any particular reason for this? Are there other methods to achieve what we want to do?

albanD · October 31, 2018, 9:48am

Hi,

Using a GPU more than once does not make sense here. It would mean executing python code twice in two different threads and run twice as many ops on the gpu (which are half the size). And GPUs are really bad at running small ops.

If the GPU usage is already 100% all the time, then the GPU is fully used and there is nothing you can do to speed things up more from the code point of view. Of course reducing model size/architecture could.

skk · October 31, 2018, 2:52pm

Hi, I’m in fact interested in that exact behaviour. I want multiple threads to use the same GPU since the CPU is a huge bottleneck and I want to mitigate that by having the CPU portion of the model be parallelized at the cost of running more ops on the same GPU at the same time. Since the GPU is already running small ops even with a single thread, I want to at least see if the benefit from parallelizing the CPU can beat out the penalty from doing the concurrent ops on the GPU. Any thoughts?

albanD · October 31, 2018, 2:56pm

You said above that GPU usage was already 100%. There is nothing you can do to go faster. Your CPU is already waiting on the GPU to finish computing stuff before continuing.

skk · October 31, 2018, 3:29pm

I meant to say that the GPU is not at 100%. Sorry if I mistyped.

albanD · October 31, 2018, 3:37pm

The thing is that by setting 2 threads to the same gpu what will happen is that the original work that this gpu was doing will be split in two. And then executed as two different worloads. The total amount of work done on the GPU will be EXACTLY the same. Just doing twice the number of operations and smaller ones.
So you will execute more cpu code and process the same amount of data on the GPU. It can’t speed up the computations.

skk · October 31, 2018, 3:53pm

I am trying to say that the GPU is not the problem here. A large fraction of time is being spent on CPU processing so I want to use the multithreading capability that the machine has for the CPU. I understand that the GPU is not going to run any faster, but that’s not the goal in the first place.

albanD · October 31, 2018, 4:17pm

Is part of the model you fit in the dataparallel actually running computation on the cpu?

skk · October 31, 2018, 5:17pm

Yes, there is significant CPU work in the model forward() as evidenced by profiling, so the DataParallel threads are doing CPU work.

albanD · October 31, 2018, 6:06pm

But the thing is that:

If they use pytorch ops mainly, they should already use all the cores available and thus more threads won’t help
If they do python stuff, they will be blocked by the GIL in multithread and so won’t run more python code either.

You are not in these cases?

skk · October 31, 2018, 7:11pm

I am not in the first case; they are mainly numpy and list manipulation ops.

I don’t believe that I am in the second case either since running DataParallel with 2 GPUs has the CPU running twice as fast for each thread (I profiled this as well), since the threads are not really accessing global memory.

albanD · November 1, 2018, 10:42am

The think is that even when running multiple thread, only one of them can run python code at a given time. All the others have to wait. A quick intro about the GIL can be found here.

The numpy operations might benefit a bit from it if they are matrix multiplications and they don’t already use multiple cores. But replacing these ops with pytorch version will use all the cores without need for multithreading.

To go back to the original question, DataParallel does not support using multiple times the same GPU because it won’t give any advantage if you use pytorch ops.
If you use other ops that could benefit from multithreading, I guess you will need to use python’s builting threading library to paralellelize the part of you code that can be (keep in mind this is mostly IO and some library calls that release the GIL and are monocore.

skk · November 8, 2018, 7:39pm

Hi, I did some extensive research into the GIL and I think I am understanding what you are saying. Thank you for the new insights.
It seems that multiprocessing.Pool can bypass this requirement so I will consider using this to speed up my code - but am I correct in assuming that DataParallel does not use multiprocessing.Pool and is therefore still limited by the GIL?

albanD · November 12, 2018, 11:08am

Yes DataParallel use threads and so is blocked by the GIL for CPU intensive tasks.