We are using PyTorch in an application where the model forward() is being bottlenecked by CPU speed as well as GPU speed. As a solution, we considered using DataParallel to parallelize batch processing. Although we only have 2 GPUs, we hope to use 8 or even 16 threads to cut down the CPU cost (this should be fine since the GPU usage is not at 100% during forward()).
We have the following line
model = nn.DataParallel(model, device_ids = [0, 0, 1, 1])
which gives the error
File "/home/kezhang/top_ml/top_ml/engine.py", line 277, in train
label_outputs=self.model(constituents, transitions, seq_lengths)
File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 122, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 127, in replicate
return replicate(module, device_ids)
File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/kezhang/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: inputs must be on unique devices
suggesting that GPUs need to be unique for DataParallel to work. Is there any particular reason for this? Are there other methods to achieve what we want to do?
Using a GPU more than once does not make sense here. It would mean executing python code twice in two different threads and run twice as many ops on the gpu (which are half the size). And GPUs are really bad at running small ops.
If the GPU usage is already 100% all the time, then the GPU is fully used and there is nothing you can do to speed things up more from the code point of view. Of course reducing model size/architecture could.
Hi, I’m in fact interested in that exact behaviour. I want multiple threads to use the same GPU since the CPU is a huge bottleneck and I want to mitigate that by having the CPU portion of the model be parallelized at the cost of running more ops on the same GPU at the same time. Since the GPU is already running small ops even with a single thread, I want to at least see if the benefit from parallelizing the CPU can beat out the penalty from doing the concurrent ops on the GPU. Any thoughts?
You said above that GPU usage was already 100%. There is nothing you can do to go faster. Your CPU is already waiting on the GPU to finish computing stuff before continuing.
The thing is that by setting 2 threads to the same gpu what will happen is that the original work that this gpu was doing will be split in two. And then executed as two different worloads. The total amount of work done on the GPU will be EXACTLY the same. Just doing twice the number of operations and smaller ones.
So you will execute more cpu code and process the same amount of data on the GPU. It can’t speed up the computations.
I am trying to say that the GPU is not the problem here. A large fraction of time is being spent on CPU processing so I want to use the multithreading capability that the machine has for the CPU. I understand that the GPU is not going to run any faster, but that’s not the goal in the first place.
I am not in the first case; they are mainly numpy and list manipulation ops.
I don’t believe that I am in the second case either since running DataParallel with 2 GPUs has the CPU running twice as fast for each thread (I profiled this as well), since the threads are not really accessing global memory.
The think is that even when running multiple thread, only one of them can run python code at a given time. All the others have to wait. A quick intro about the GIL can be found here.
The numpy operations might benefit a bit from it if they are matrix multiplications and they don’t already use multiple cores. But replacing these ops with pytorch version will use all the cores without need for multithreading.
To go back to the original question, DataParallel does not support using multiple times the same GPU because it won’t give any advantage if you use pytorch ops.
If you use other ops that could benefit from multithreading, I guess you will need to use python’s builting threading library to paralellelize the part of you code that can be (keep in mind this is mostly IO and some library calls that release the GIL and are monocore.
Hi, I did some extensive research into the GIL and I think I am understanding what you are saying. Thank you for the new insights.
It seems that multiprocessing.Pool can bypass this requirement so I will consider using this to speed up my code - but am I correct in assuming that DataParallel does not use multiprocessing.Pool and is therefore still limited by the GIL?