Multiple replicas of the model on same GPU?

Hi, I am a newbee for pytorch distributed.
My model is only a small component of a much more complicated problem.
I noticed that if I train it using single-GPU, then it takes at most one quarter of the GPU memory and utility.
So I wonder if it is possible to distribute four replicas of the model on the same GPU so that hopefully I can get 4x speedup.

I read the documents and there are many example of multi-gpu, but none of them is using fractional gpu like this. Anyone have ideas? Thanks.

It really depends. Even if 4 replicas of your model can fit into the memory of one GPU, they still need to compete for the same set of streaming multiprocessors and other shared resources on that GPU. You can try if using multiple streams would help, e.g.:

s0 = torch.cuda.Stream()
s1 = torch.cuda.Stream()
with torch.cuda.stream(s0):
    output0 = model_replica0(input0)

with torch.cuda.stream(s1):
    output1 = model_replica1(input1)

s0.synchronize()
s1.synchronize()

Hi, Shen Li, Hi, the “DistributedDataParallel” automatically average the gradient when calling “loss.backward()”,
But I didn’t find the corresponding script in pytorch source code, Do you know where it is ?

Hey @meilu_zhu

Sorry about the delay. The grad averaging algorithm is implemented in the reducer. Each DistributedDataParallel creates its reducer instance in the constructor. More specifically, allreduce is invoked here.