Why was multi-gpu mode not faster than single-gpu in my model?

I trained a network with about 1200MB parameters on a dataset with about 50,000 images.
It took about 200 minute to train an epoch when I use a single GPU. While it took 200 minute to train an epoch when using multiple GPUs(3 GPU) too. To find out why, I check the time cost in multi gpu code.

time_verbose = False
start_time = time.time()
batch.scatter()
scatter_time = time.time()
if self.num_gpus == 1:
    outputs = self(*batch[0])
    output_time = time.time()
    if time_verbose:
        print('scatter {}, output {}'.format(
            scatter_time - start_time,
            output_time - scatter_time,
        ))
else:
    replicas = nn.parallel.replicate(self, devices=list(range(self.num_gpus)))
    replica_time = time.time()
    outputs = nn.parallel.parallel_apply(replicas, [batch[i] for i in range(self.num_gpus)])
    output_time = time.time()
    if self.training:
        outputs = gather_res(outputs, 0, dim=0)
    gather_time = time.time()
    if time_verbose:
        print('scatter {}, replicate {}, output {}, gather {}'.format(
            scatter_time - start_time,
            replica_time - scatter_time,
            output_time - replica_time,
            gather_time - output_time
        ))

when using 3 GPU, the output time is about 0.2 second

scatter 0.010387897491455078, replicate 0.008687257766723633, output 0.272158145904541, gather 0.002345561981201172

when using one GPU, the output time is about 0.07 second

scatter 0.03184366226196289, output 0.07560396194458008

It means the parallel training maybe did not work at all. Does anybody know why?