Implement dataparallel on big vocab is much slower than single gpu

k3833653 · June 15, 2019, 1:02pm

I’m running a bilstm + crf model. If I chose vocab_size of 10000, batch_size = 512, the forward speed is equal to use dataparalle , gpu = 4, batch_size = 2048. However, if I chose vocab_size of 0.5million, the forward speed on sigle gpu is 8 times faster than dataparallel, gpu = 4, batch_size = 2048. The speed on two case of single gpu is similar. I don’t know why . I have already set embedding.require_grad = False

ptrblck · June 15, 2019, 4:02pm

If your module has a lot of parameters, wrapping it in nn.DataParallel might create a slowdown since these parameters have to be scattered.
Have a look at the ImageNet example, where only model.features is wrapped in data parallel. This is also explained in the One weird trick paper.

In your use case you could try to leave the embedding layer on a single device.

k3833653 · June 17, 2019, 8:40am

Thanks for your advice. If l leave embedding layer on a single device, like gpu:0, the other model parameter on gpu:1, these data can be update at the same time?

ptrblck · June 17, 2019, 9:58am

Yes, as you can see in the ImageNet example, you can just use DataParallel on a certain part or your model, which will work fine.
In fact you can split your model in any way, e.g. leaving one part on the CPU, another on a single GPU, and again another in DataParallel.
As long as you push the activations and inputs to the appropriate device, everything will work fine.
Otherwise, you’ll get some errors.

k3833653 · June 17, 2019, 10:11am

Thanks, I have solved the case after following your advice:wink: