Multiple replicas of the model on same GPU?

Hi, I am a newbee for pytorch distributed.
My model is only a small component of a much more complicated problem.
I noticed that if I train it using single-GPU, then it takes at most one quarter of the GPU memory and utility.
So I wonder if it is possible to distribute four replicas of the model on the same GPU so that hopefully I can get 4x speedup.

I read the documents and there are many example of multi-gpu, but none of them is using fractional gpu like this. Anyone have ideas? Thanks.