Data Parallelism on a single GPU card


I want to train a language model with various batch sizes, esp. batch size = 1. As expected, it takes much longer time to train one epoch but at the same time the gpu utility is very low. Can I speed up this process by data parallelism using more workers (?) on the same gpu card ? (Maybe something similar to GPU version of “Hogwild” with data parallelism ?)


I think the reduce param in the Loss functions might be helpful.

reduce (bool, optional) – By default, the losses are averaged or summed for each minibatch. When reduce is False, the loss function returns a loss per batch element instead and ignores size_average. Default: True

Using this, you could increase the batch size and get the losses for each batch.