I trained a network with about 1200MB parameters on a dataset with about 50,000 images.
It took about 200 minute to train an epoch when I use a single GPU. While it took 200 minute to train an epoch when using multiple GPUs(3 GPU) too. To find out why, I check the time cost in multi gpu code.
time_verbose = False
start_time = time.time()
batch.scatter()
scatter_time = time.time()
if self.num_gpus == 1:
outputs = self(*batch[0])
output_time = time.time()
if time_verbose:
print('scatter {}, output {}'.format(
scatter_time - start_time,
output_time - scatter_time,
))
else:
replicas = nn.parallel.replicate(self, devices=list(range(self.num_gpus)))
replica_time = time.time()
outputs = nn.parallel.parallel_apply(replicas, [batch[i] for i in range(self.num_gpus)])
output_time = time.time()
if self.training:
outputs = gather_res(outputs, 0, dim=0)
gather_time = time.time()
if time_verbose:
print('scatter {}, replicate {}, output {}, gather {}'.format(
scatter_time - start_time,
replica_time - scatter_time,
output_time - replica_time,
gather_time - output_time
))
when using 3 GPU, the output time is about 0.2 second
scatter 0.010387897491455078, replicate 0.008687257766723633, output 0.272158145904541, gather 0.002345561981201172
when using one GPU, the output time is about 0.07 second
scatter 0.03184366226196289, output 0.07560396194458008
It means the parallel training maybe did not work at all. Does anybody know why?