Distributed training slow

I training my model on one GPU(v100), the speed as below:

2019-09-17 01:46:48,876 - INFO - [ 1022/10000]	lr: 0.000100	Time  1.773 ( 1.744)	Data  0.001 ( 0.001)	Loss 6341.771 (6945.944)
2019-09-17 01:46:50,593 - INFO - [ 1023/10000]	lr: 0.000100	Time  1.607 ( 1.722)	Data  0.001 ( 0.001)	Loss 7225.229 (6958.357)
2019-09-17 01:46:52,323 - INFO - [ 1024/10000]	lr: 0.000100	Time  1.717 ( 1.732)	Data  0.001 ( 0.001)	Loss 7218.038 (6929.233)

The format of time info, such as Time 1.717 ( 1.732), that 1.717 is current batch, 1.732 is the one hundred batch recently cost time.

when I use 8GPU on one node, use torch.nn.parallel.DistributedDataParallel, torch.nn.SyncBatchNorm.convert_sync_batchnorm and mp.spawn, the speed as below:

019-09-16 06:06:40,619 - INFO - [   9/5000]    lr: 0.000036    Time  2.822 ( 4.896)    Data  0.001 ( 1.428)    Loss 307113.969 (331260.794)
2019-09-16 06:06:43,485 - INFO - [  10/5000]    lr: 0.000037    Time  3.419 ( 4.749)    Data  0.001 ( 0.001)    Loss 303037.688 (325792.062)
2019-09-16 06:06:46,120 - INFO - [  11/5000]    lr: 0.000037    Time  2.866 ( 2.943)    Data  0.001 ( 0.001)    Loss 296579.000 (320417.425)
2019-09-16 06:06:48,925 - INFO - [  12/5000]    lr: 0.000037    Time  2.634 ( 2.879)    Data  0.001 ( 0.001)    Loss 292080.625 (315081.881)
2019-09-16 06:06:51,671 - INFO - [  13/5000]    lr: 0.000038    Time  2.806 ( 2.847)    Data  0.001 ( 0.001)    Loss 286678.000 (309843.294)

The speedup ratio of 8GPU is about 0.6.