Parallel Implementation of Gaussian mixture models for working with multiple GPU's

I’m currently working on language recognition and I wanted to implement a Gaussian mixture model which can run parallelly on multiple GPUs(I have 2 GPU’s of 11 Gb each) . As of now, I have implemented a naive GMM comprising normal def functions which does everything on CPU without any parallelization and optimization(Like I don’t need to store the grads of tensors as well) .I wanted to get an idea or some suggestions about how I should go ahead with Parallelization implementation of GMM!

Would you like to apply some data parallelism (cloning the model on each GPU and splitting the data) or would you like to shard the model somehow (different GPUs compute different parts of your model)?

I was thinking of the first way similar to nn.Dataparallel approach! This seems possible to me…
I’m not sure how I can go about if I want to implement it in the second way though.