Data vs Model Parallelism

Hello.

I am doing some stuff with multi gpu in pytorch. I am training a 18 deep VGG convolutional network on CIFAR. Using the module data parallel with two GPUS 1080 8Gb and cudnn library on pytorch 0.2, one forward pass over the whole data set takes 0.73 minutes.

When using one GPU and cudnn it takes 0.79-0.81 minutes. For sure this is not a great speed performance. It is better to train two networks using one GPU per each than one network using two gpus and then two networks, as it takes exactly the same time. This is because in this kind of network all the time is spent in the convolutions so is GPU work, not CPU.

I was wondering if pytorch has or it will have what is call model parallelism. I think data parallelism is good for prediction, when we have as example 10000 images where we want to inference the output of the networks. However if we have a VGG let say with 512 kernels and a feature map of 256 as input we will have 256*512 convolutions to perform. So I think if doing model parallelism, that is, distributing this convolutions between gpus will have a great increase in performance. It is well explained here https://arxiv.org/pdf/1404.5997.pdf,