Using DataParallel with the same GPU multiple times

We are using PyTorch in an application where the model forward() is being bottlenecked by CPU speed as well as GPU speed. As a solution, we considered using DataParallel to parallelize batch processing. Although we only have 2 GPUs, we hope to use 8 or even 16 threads to cut down the CPU cost (this should be fine since the GPU usage is not at 100% during forward()).

We have the following line

model = nn.DataParallel(model, device_ids = [0, 0, 1, 1])

which gives the error

  File "/home/kezhang/top_ml/top_ml/", line 277, in train
    label_outputs=self.model(constituents, transitions, seq_lengths)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/modules/", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/", line 122, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/", line 127, in replicate
    return replicate(module, device_ids)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/", line 12, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/", line 19, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/home/kezhang/.local/lib/python3.6/site-packages/torch/cuda/", line 40, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices

suggesting that GPUs need to be unique for DataParallel to work. Is there any particular reason for this? Are there other methods to achieve what we want to do?


Using a GPU more than once does not make sense here. It would mean executing python code twice in two different threads and run twice as many ops on the gpu (which are half the size). And GPUs are really bad at running small ops.

If the GPU usage is already 100% all the time, then the GPU is fully used and there is nothing you can do to speed things up more from the code point of view. Of course reducing model size/architecture could.

Hi, I’m in fact interested in that exact behaviour. I want multiple threads to use the same GPU since the CPU is a huge bottleneck and I want to mitigate that by having the CPU portion of the model be parallelized at the cost of running more ops on the same GPU at the same time. Since the GPU is already running small ops even with a single thread, I want to at least see if the benefit from parallelizing the CPU can beat out the penalty from doing the concurrent ops on the GPU. Any thoughts?

You said above that GPU usage was already 100%. There is nothing you can do to go faster. Your CPU is already waiting on the GPU to finish computing stuff before continuing.

I meant to say that the GPU is not at 100%. Sorry if I mistyped.

The thing is that by setting 2 threads to the same gpu what will happen is that the original work that this gpu was doing will be split in two. And then executed as two different worloads. The total amount of work done on the GPU will be EXACTLY the same. Just doing twice the number of operations and smaller ones.
So you will execute more cpu code and process the same amount of data on the GPU. It can’t speed up the computations.

I am trying to say that the GPU is not the problem here. A large fraction of time is being spent on CPU processing so I want to use the multithreading capability that the machine has for the CPU. I understand that the GPU is not going to run any faster, but that’s not the goal in the first place.

Is part of the model you fit in the dataparallel actually running computation on the cpu?

Yes, there is significant CPU work in the model forward() as evidenced by profiling, so the DataParallel threads are doing CPU work.

But the thing is that:

  • If they use pytorch ops mainly, they should already use all the cores available and thus more threads won’t help
  • If they do python stuff, they will be blocked by the GIL in multithread and so won’t run more python code either.

You are not in these cases?

I am not in the first case; they are mainly numpy and list manipulation ops.

I don’t believe that I am in the second case either since running DataParallel with 2 GPUs has the CPU running twice as fast for each thread (I profiled this as well), since the threads are not really accessing global memory.

The think is that even when running multiple thread, only one of them can run python code at a given time. All the others have to wait. A quick intro about the GIL can be found here.

The numpy operations might benefit a bit from it if they are matrix multiplications and they don’t already use multiple cores. But replacing these ops with pytorch version will use all the cores without need for multithreading.

To go back to the original question, DataParallel does not support using multiple times the same GPU because it won’t give any advantage if you use pytorch ops.
If you use other ops that could benefit from multithreading, I guess you will need to use python’s builting threading library to paralellelize the part of you code that can be (keep in mind this is mostly IO and some library calls that release the GIL and are monocore.

Hi, I did some extensive research into the GIL and I think I am understanding what you are saying. Thank you for the new insights.
It seems that multiprocessing.Pool can bypass this requirement so I will consider using this to speed up my code - but am I correct in assuming that DataParallel does not use multiprocessing.Pool and is therefore still limited by the GIL?

Yes DataParallel use threads and so is blocked by the GIL for CPU intensive tasks.