We are using PyTorch in an application where the model forward() is being bottlenecked by CPU speed as well as GPU speed. As a solution, we considered using DataParallel to parallelize batch processing. Although we only have 2 GPUs, we hope to use 8 or even 16 threads to cut down the CPU cost (this should be fine since the GPU usage is not at 100% during forward()).
We have the following line
model = nn.DataParallel(model, device_ids = [0, 0, 1, 1])
which gives the error
File "/home/kezhang/top_ml/top_ml/engine.py", line 277, in train label_outputs=self.model(constituents, transitions, seq_lengths) File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 122, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 127, in replicate return replicate(module, device_ids) File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate param_copies = Broadcast.apply(devices, *params) File "/home/kezhang/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus) File "/home/kezhang/.local/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: inputs must be on unique devices
suggesting that GPUs need to be unique for DataParallel to work. Is there any particular reason for this? Are there other methods to achieve what we want to do?