Hey @blade, Sorry about the delay. I submitted a PR to your repo that contains the debug code I added to get the following outputs. Several observations/comments:
The original code measures the delay of the entire train() method. It will be more informative to look at the delay of an individual forward pass, as data preparation etc. might also contribute to the total training delay, and we would like to exclude those.
In the outputs below, the Forward latency measures the delay of executing model(data) by using CUDA events to wrap the model(data) invocation. The Inner Forward latency measures the delay within the model.forward() method. For non-DP models, the two delays should be the same. For DP models, the Inner Forward latency should be smaller as it does not contain the model replication, input scattering and output gathering overhead.
From the outputs, we can see that, the non-DP model’s inner forward pass takes about 11-12 ms to finish, while the DP model’s inner forward pass takes about 13-15 ms to finish, which looks reasonable as the additional delay might be caused by GIL contention.
The DP’ outer forward pass takes around 17-18 ms, which suggests the replication + scattering + gathering takes additional 2-5 ms.
If this additional overhead from DP is not acceptable, you might want to try DistributedDataParallel.
device_count() 2
get_device_name Quadro GP100
get_device_name Quadro GP100
===== Warn Up =====
Inner Forward latency at cuda:0 is: 1505.4864501953125
Forward latency is: 1505.63232421875
[W TensorIterator.cpp:917] Warning: Mixed memory format inputs detected while calling the operator. The operator will output contiguous tensor even if some of the in
puts are in channels_last format. (function operator())
[W TensorIterator.cpp:923] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the
inputs are not in channels_last format. (function operator())
Inner Forward latency at cuda:0 is: 55.495487213134766
Forward latency is: 55.64147186279297
Inner Forward latency at cuda:0 is: 35.402400970458984
Forward latency is: 35.53887939453125
Inner Forward latency at cuda:0 is: 9.02233600616455
Forward latency is: 9.138496398925781
Inner Forward latency at cuda:0 is: 25.25040054321289
Forward latency is: 25.4737606048584
Inner Forward latency at cuda:0 is: 11.507648468017578
Forward latency is: 11.665023803710938
Inner Forward latency at cuda:0 is: 10.732000350952148
Forward latency is: 10.859199523925781
Train Epoch: 1 Loss: 8.790261
===== Single GPU =====
Inner Forward latency at cuda:0 is: 11.000991821289062
Forward latency is: 11.108160018920898
Inner Forward latency at cuda:0 is: 11.089119911193848
Forward latency is: 11.22646427154541
Inner Forward latency at cuda:0 is: 11.005184173583984
Forward latency is: 11.144991874694824
Inner Forward latency at cuda:0 is: 11.082048416137695
Forward latency is: 11.23363208770752
Inner Forward latency at cuda:0 is: 12.89516830444336
Forward latency is: 13.033023834228516
Inner Forward latency at cuda:0 is: 11.095647811889648
Forward latency is: 11.230015754699707
Inner Forward latency at cuda:0 is: 10.849760055541992
Forward latency is: 10.985152244567871
Train Epoch: 1 Loss: 8.162259
===== DataParallel =====
Inner Forward latency at cuda:0 is: 32.145503997802734
Inner Forward latency at cuda:1 is: 1494.096435546875
Forward latency is: 3853.993896484375
Inner Forward latency at cuda:0 is: 13.01587200164795
Inner Forward latency at cuda:1 is: 15.880448341369629
Forward latency is: 18.9583683013916
Inner Forward latency at cuda:1 is: 13.992992401123047
Inner Forward latency at cuda:0 is: 14.519583702087402
Forward latency is: 17.375328063964844
Inner Forward latency at cuda:0 is: 14.469663619995117
Inner Forward latency at cuda:1 is: 14.402463912963867
Forward latency is: 17.670656204223633
Inner Forward latency at cuda:0 is: 13.701855659484863
Inner Forward latency at cuda:1 is: 14.609951972961426
Forward latency is: 17.782848358154297
Inner Forward latency at cuda:0 is: 14.08131217956543
Inner Forward latency at cuda:1 is: 14.001919746398926
Forward latency is: 17.2807674407959
Inner Forward latency at cuda:1 is: 13.710176467895508
Inner Forward latency at cuda:0 is: 14.413311958312988
Forward latency is: 17.411136627197266
Train Epoch: 1 Loss: 7.542133
Thank you very much @mrshenli for the thorough explanation. When running your modified code, I get the following error
Traceback (most recent call last):
File “main.py”, line 139, in
main()
File “main.py”, line 124, in main
model.train()
File “/afs/blade/Project_B/VAEtrain.py”, line 137, in train
self.trainepoch(epoch)
File “/afs/blade/Project_B/VAEtrain.py”, line 116, in trainepoch
print('Forward latency is: {} '.format(e_start.elapsed_time(e_finish)))
File “/afs/blade/.conda/envs/My_env/lib/python3.6/site-packages/torch/cuda/streams.py”, line 166, in elapsed_time
return super(Event, self).elapsed_time(end_event)
RuntimeError: CUDA error: device not ready
which I fixed by adding torch.cuda.synchronize() before each print statement (last two commit on Github). Not sure why it worked when you run it but not when I do. With 600 training data and batch size of 32 I get the following results:
device_count() 2
get_device_name Quadro RTX 6000
get_device_name Quadro RTX 6000
===== Warn Up =====
Inner Forward latency at cuda:0 is: 312.0291748046875
Forward latency is: 312.20123291015625
Inner Forward latency at cuda:0 is: 7.032832145690918
Forward latency is: 7.154176235198975
Inner Forward latency at cuda:0 is: 6.665823936462402
Forward latency is: 6.760447978973389
Inner Forward latency at cuda:0 is: 6.610943794250488
Forward latency is: 6.698976039886475
Inner Forward latency at cuda:0 is: 6.579264163970947
Forward latency is: 6.668288230895996
Inner Forward latency at cuda:0 is: 6.5924482345581055
Forward latency is: 6.685696125030518
Inner Forward latency at cuda:0 is: 6.5490241050720215
Forward latency is: 6.63756799697876
Inner Forward latency at cuda:0 is: 6.6170878410339355
Forward latency is: 6.7051520347595215
Inner Forward latency at cuda:0 is: 6.561567783355713
Forward latency is: 6.649856090545654
Inner Forward latency at cuda:0 is: 6.608831882476807
Forward latency is: 6.696991920471191
Inner Forward latency at cuda:0 is: 6.565536022186279
Forward latency is: 6.652768135070801
Inner Forward latency at cuda:0 is: 7.073791980743408
Forward latency is: 7.168000221252441
Inner Forward latency at cuda:0 is: 6.677855968475342
Forward latency is: 6.764575958251953
Inner Forward latency at cuda:0 is: 6.579904079437256
Forward latency is: 6.668064117431641
Inner Forward latency at cuda:0 is: 6.5595197677612305
Forward latency is: 6.647808074951172
Inner Forward latency at cuda:0 is: 7.110752105712891
Forward latency is: 7.204095840454102
Inner Forward latency at cuda:0 is: 6.617055892944336
Forward latency is: 6.7060160636901855
Inner Forward latency at cuda:0 is: 6.559423923492432
Forward latency is: 6.6473917961120605
Inner Forward latency at cuda:0 is: 6.67033576965332
Forward latency is: 6.759903907775879
Train Epoch: 1 Loss: 3.500732
===== Single GPU =====
Inner Forward latency at cuda:0 is: 11.719776153564453
Forward latency is: 11.848064422607422
Inner Forward latency at cuda:0 is: 6.635519981384277
Forward latency is: 6.722879886627197
Inner Forward latency at cuda:0 is: 6.606272220611572
Forward latency is: 6.693664073944092
Inner Forward latency at cuda:0 is: 6.573984146118164
Forward latency is: 6.675968170166016
Inner Forward latency at cuda:0 is: 6.516831874847412
Forward latency is: 6.606847763061523
Inner Forward latency at cuda:0 is: 6.608352184295654
Forward latency is: 6.696959972381592
Inner Forward latency at cuda:0 is: 6.593344211578369
Forward latency is: 6.689760208129883
Inner Forward latency at cuda:0 is: 38.63993453979492
Forward latency is: 38.72832107543945
Inner Forward latency at cuda:0 is: 6.6211838722229
Forward latency is: 6.716095924377441
Inner Forward latency at cuda:0 is: 6.551616191864014
Forward latency is: 6.639872074127197
Inner Forward latency at cuda:0 is: 7.57587194442749
Forward latency is: 7.692992210388184
Inner Forward latency at cuda:0 is: 6.579360008239746
Forward latency is: 6.665279865264893
Inner Forward latency at cuda:0 is: 58.9370231628418
Forward latency is: 59.02336120605469
Inner Forward latency at cuda:0 is: 6.718719959259033
Forward latency is: 6.813632011413574
Inner Forward latency at cuda:0 is: 6.606624126434326
Forward latency is: 6.692448139190674
Inner Forward latency at cuda:0 is: 6.61740779876709
Forward latency is: 6.706719875335693
Inner Forward latency at cuda:0 is: 6.545951843261719
Forward latency is: 6.634687900543213
Inner Forward latency at cuda:0 is: 6.5739521980285645
Forward latency is: 6.660799980163574
Inner Forward latency at cuda:0 is: 6.364831924438477
Forward latency is: 6.4543681144714355
Train Epoch: 1 Loss: 2.285649
===== DataParallel =====
Inner Forward latency at cuda:0 is: 42.26464080810547
Inner Forward latency at cuda:1 is: 393.2672119140625
Forward latency is: 4947.4130859375
Inner Forward latency at cuda:0 is: 17.808063507080078
Inner Forward latency at cuda:1 is: 22.410879135131836
Forward latency is: 26.27788734436035
Inner Forward latency at cuda:1 is: 20.338655471801758
Inner Forward latency at cuda:0 is: 22.937599182128906
Forward latency is: 26.417760848999023
Inner Forward latency at cuda:1 is: 18.742080688476562
Inner Forward latency at cuda:0 is: 21.928159713745117
Forward latency is: 25.05731201171875
Inner Forward latency at cuda:1 is: 13.354111671447754
Inner Forward latency at cuda:0 is: 16.105663299560547
Forward latency is: 19.50511932373047
Inner Forward latency at cuda:0 is: 21.91974449157715
Inner Forward latency at cuda:1 is: 23.952512741088867
Forward latency is: 27.585535049438477
Inner Forward latency at cuda:0 is: 22.37603187561035
Inner Forward latency at cuda:1 is: 22.983200073242188
Forward latency is: 26.94927978515625
Inner Forward latency at cuda:1 is: 18.544095993041992
Inner Forward latency at cuda:0 is: 20.25881576538086
Forward latency is: 23.73529624938965
Inner Forward latency at cuda:0 is: 20.680864334106445
Inner Forward latency at cuda:1 is: 22.002399444580078
Forward latency is: 26.0317440032959
Inner Forward latency at cuda:0 is: 19.656959533691406
Inner Forward latency at cuda:1 is: 21.027456283569336
Forward latency is: 24.815616607666016
Inner Forward latency at cuda:0 is: 22.988128662109375
Inner Forward latency at cuda:1 is: 23.180959701538086
Forward latency is: 27.598848342895508
Inner Forward latency at cuda:1 is: 21.547008514404297
Inner Forward latency at cuda:0 is: 22.051231384277344
Forward latency is: 25.91289520263672
Inner Forward latency at cuda:1 is: 19.94380760192871
Inner Forward latency at cuda:0 is: 22.607072830200195
Forward latency is: 26.406591415405273
Inner Forward latency at cuda:0 is: 22.456287384033203
Inner Forward latency at cuda:1 is: 23.465984344482422
Forward latency is: 27.797727584838867
Inner Forward latency at cuda:1 is: 21.703231811523438
Inner Forward latency at cuda:0 is: 22.716032028198242
Forward latency is: 27.013696670532227
Inner Forward latency at cuda:0 is: 23.64579200744629
Inner Forward latency at cuda:1 is: 24.85055923461914
Forward latency is: 29.45417594909668
Inner Forward latency at cuda:1 is: 24.470399856567383
Inner Forward latency at cuda:0 is: 26.32054328918457
Forward latency is: 30.444063186645508
Inner Forward latency at cuda:0 is: 21.156511306762695
Inner Forward latency at cuda:1 is: 22.762048721313477
Forward latency is: 27.273216247558594
Inner Forward latency at cuda:0 is: 25.427295684814453
Inner Forward latency at cuda:1 is: 25.92563247680664
Forward latency is: 30.507776260375977
Train Epoch: 1 Loss: 1.831364
As you can see, when I run non-DP model, the inner forward pass takes about 6-7 ms while it takes the DP model 21-23 ms to run. This seems a way larger gap. Does this look alright? I’m not sure if synchronizing before the print statement causes this or not, but I was not able to run the code without that change. May I ask how many training data and what batch size you are using so that we compare the same results?
Good catch. I added e_finish.synchronize() before calling elapsed_time to fix the problem. It works with the prior code might be due to lucky timing, i.e., kernels are all done when consulting the event. I still see similar result (outputs pasted to the end of this post).
As you can see, when I run non-DP model, the inner forward pass takes about 6-7 ms while it takes the DP model 21-23 ms to run. This seems a way larger gap. Does this look alright?
This does not look correct to me, but I cannot explain. The DP thread shouldn’t take 4X for every forward pass. BTW, what is your GPU type and what is the CUDA version?
May I ask how many training data and what batch size you are using so that we compare the same results?
I just run the code as-is using python main.py . I copied the .data file to data folder as it seems the program expects to read it from there.
===== Warn Up =====
Inner Forward latency at cuda:0 is: 1488.1781005859375
Forward latency is: 1488.338134765625
[W TensorIterator.cpp:917] Warning: Mixed memory format inputs detected while calling the operator. The operator will output contiguous tensor even if some of the in
puts are in channels_last format. (function operator())
[W TensorIterator.cpp:923] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the
inputs are not in channels_last format. (function operator())
Inner Forward latency at cuda:0 is: 58.0959358215332
Forward latency is: 58.23100662231445
Inner Forward latency at cuda:0 is: 8.668992042541504
Forward latency is: 8.790816307067871
Inner Forward latency at cuda:0 is: 8.797216415405273
Forward latency is: 8.929183959960938
Inner Forward latency at cuda:0 is: 18.98464012145996
Forward latency is: 19.132671356201172
Inner Forward latency at cuda:0 is: 34.01536178588867
Forward latency is: 34.20316696166992
Inner Forward latency at cuda:0 is: 8.927007675170898
Forward latency is: 9.08579158782959
Train Epoch: 1 Loss: 8.790261
===== Single GPU =====
Inner Forward latency at cuda:0 is: 8.95967960357666
Forward latency is: 9.068127632141113
Inner Forward latency at cuda:0 is: 8.704704284667969
Forward latency is: 8.856863975524902
Inner Forward latency at cuda:0 is: 8.709856033325195
Forward latency is: 8.831232070922852
Inner Forward latency at cuda:0 is: 8.72275161743164
Forward latency is: 8.845024108886719
Inner Forward latency at cuda:0 is: 18.166719436645508
Forward latency is: 18.28998374938965
Inner Forward latency at cuda:0 is: 8.70742416381836
Forward latency is: 8.828767776489258
Inner Forward latency at cuda:0 is: 8.335904121398926
Forward latency is: 8.45967960357666
Train Epoch: 1 Loss: 8.162259
===== DataParallel Warn Up =====
Inner Forward latency at cuda:0 is: 29.18454360961914
Inner Forward latency at cuda:1 is: 1496.7349853515625
Forward latency is: 3831.19189453125
Inner Forward latency at cuda:1 is: 15.559136390686035
Inner Forward latency at cuda:0 is: 15.93996810913086
Forward latency is: 19.217119216918945
Inner Forward latency at cuda:0 is: 13.919391632080078
Inner Forward latency at cuda:1 is: 14.139103889465332
Forward latency is: 17.232608795166016
Inner Forward latency at cuda:1 is: 14.457535743713379
Inner Forward latency at cuda:0 is: 14.94876766204834
Forward latency is: 17.987136840820312
Inner Forward latency at cuda:1 is: 13.798944473266602
Inner Forward latency at cuda:0 is: 14.654208183288574
Forward latency is: 17.719648361206055
Inner Forward latency at cuda:1 is: 13.641887664794922
Inner Forward latency at cuda:0 is: 14.081600189208984
Forward latency is: 17.102848052978516
Inner Forward latency at cuda:0 is: 13.795040130615234
Inner Forward latency at cuda:1 is: 13.760607719421387
Forward latency is: 18.731584548950195
Train Epoch: 1 Loss: 7.542133
===== DataParallel =====
Inner Forward latency at cuda:0 is: 13.752639770507812
Inner Forward latency at cuda:1 is: 14.990143775939941
Forward latency is: 18.172191619873047
Inner Forward latency at cuda:0 is: 13.295552253723145
Inner Forward latency at cuda:1 is: 13.322591781616211
Forward latency is: 16.823423385620117
Inner Forward latency at cuda:0 is: 13.860544204711914
Inner Forward latency at cuda:1 is: 14.084735870361328
Forward latency is: 17.313791275024414
Inner Forward latency at cuda:0 is: 13.091039657592773
Inner Forward latency at cuda:1 is: 12.956319808959961
Forward latency is: 16.153152465820312
Inner Forward latency at cuda:1 is: 13.988863945007324
Inner Forward latency at cuda:0 is: 14.459775924682617
Forward latency is: 17.44748878479004
Inner Forward latency at cuda:0 is: 13.118144035339355
Inner Forward latency at cuda:1 is: 13.629311561584473
Forward latency is: 16.748767852783203
Inner Forward latency at cuda:1 is: 13.413951873779297
Inner Forward latency at cuda:0 is: 14.655072212219238
Forward latency is: 17.669599533081055
Train Epoch: 1 Loss: 6.871040
I’m using RTX6000 GPU cards with 24GB of memory. The CUDA version is 10.0. I’m recreating with e_finish.synchronize() and default batch/data size to be consistent with you.
device_count() 2
get_device_name Quadro RTX 6000
get_device_name Quadro RTX 6000
===== Warn Up =====
Inner Forward latency at cuda:0 is: 269.352783203125
Forward latency is: 269.5054931640625
Inner Forward latency at cuda:0 is: 6.518432140350342
Forward latency is: 6.623968124389648
Inner Forward latency at cuda:0 is: 6.06825590133667
Forward latency is: 6.1439361572265625
Inner Forward latency at cuda:0 is: 6.148096084594727
Forward latency is: 6.224703788757324
Inner Forward latency at cuda:0 is: 6.141952037811279
Forward latency is: 6.215456008911133
Inner Forward latency at cuda:0 is: 6.092895984649658
Forward latency is: 6.170144081115723
Inner Forward latency at cuda:0 is: 5.781504154205322
Forward latency is: 5.854623794555664
Train Epoch: 1 Loss: 8.880923
===== Single GPU =====
Inner Forward latency at cuda:0 is: 6.268703937530518
Forward latency is: 6.322175979614258
Inner Forward latency at cuda:0 is: 6.090367794036865
Forward latency is: 6.166687965393066
Inner Forward latency at cuda:0 is: 6.139904022216797
Forward latency is: 6.215072154998779
Inner Forward latency at cuda:0 is: 6.092895984649658
Forward latency is: 6.168735980987549
Inner Forward latency at cuda:0 is: 6.106112003326416
Forward latency is: 6.18614387512207
Inner Forward latency at cuda:0 is: 6.128384113311768
Forward latency is: 6.200736045837402
Inner Forward latency at cuda:0 is: 5.888192176818848
Forward latency is: 5.9773759841918945
Train Epoch: 1 Loss: 7.826999
===== DataParallel =====
Inner Forward latency at cuda:0 is: 29.677568435668945
Inner Forward latency at cuda:1 is: 247.988037109375
Forward latency is: 5157.865234375
Inner Forward latency at cuda:0 is: 20.65011215209961
Inner Forward latency at cuda:1 is: 22.26380729675293
Forward latency is: 26.309247970581055
Inner Forward latency at cuda:0 is: 18.71731185913086
Inner Forward latency at cuda:1 is: 20.099807739257812
Forward latency is: 25.069536209106445
Inner Forward latency at cuda:1 is: 18.127199172973633
Inner Forward latency at cuda:0 is: 21.228160858154297
Forward latency is: 27.178943634033203
Inner Forward latency at cuda:0 is: 22.07948875427246
Inner Forward latency at cuda:1 is: 21.940223693847656
Forward latency is: 28.237247467041016
Inner Forward latency at cuda:0 is: 22.045312881469727
Inner Forward latency at cuda:1 is: 22.9520320892334
Forward latency is: 28.17840003967285
Inner Forward latency at cuda:1 is: 19.71459197998047
Inner Forward latency at cuda:0 is: 20.302751541137695
Forward latency is: 26.347999572753906
Train Epoch: 1 Loss: 6.633629
I also do have access to V100 GPU cards with 32GB of memory. The output with those is the following.
device_count() 2
get_device_name Tesla V100-PCIE-32GB
get_device_name Tesla V100-PCIE-32GB
===== Warn Up =====
Inner Forward latency at cuda:0 is: 330.8124084472656
Forward latency is: 330.9803771972656
Inner Forward latency at cuda:0 is: 6.412288188934326
Forward latency is: 6.533120155334473
Inner Forward latency at cuda:0 is: 6.1184000968933105
Forward latency is: 6.200319766998291
Inner Forward latency at cuda:0 is: 6.159359931945801
Forward latency is: 6.237184047698975
Inner Forward latency at cuda:0 is: 18.338815689086914
Forward latency is: 18.41868782043457
Inner Forward latency at cuda:0 is: 6.0641279220581055
Forward latency is: 6.145023822784424
Inner Forward latency at cuda:0 is: 5.8081278800964355
Forward latency is: 5.885951995849609
Train Epoch: 1 Loss: 8.888597
===== Single GPU =====
Inner Forward latency at cuda:0 is: 6.325247764587402
Forward latency is: 6.380544185638428
Inner Forward latency at cuda:0 is: 6.120448112487793
Forward latency is: 6.201344013214111
Inner Forward latency at cuda:0 is: 6.12659215927124
Forward latency is: 6.204415798187256
Inner Forward latency at cuda:0 is: 6.134751796722412
Forward latency is: 6.215680122375488
Inner Forward latency at cuda:0 is: 6.090752124786377
Forward latency is: 6.168575763702393
Inner Forward latency at cuda:0 is: 6.11737585067749
Forward latency is: 6.193151950836182
Inner Forward latency at cuda:0 is: 66.82112121582031
Forward latency is: 66.90201568603516
Train Epoch: 1 Loss: 7.709968
===== DataParallel =====
Inner Forward latency at cuda:1 is: 268.62591552734375
Inner Forward latency at cuda:0 is: 275.9311218261719
Forward latency is: 5106.107421875
Inner Forward latency at cuda:0 is: 18.130943298339844
Inner Forward latency at cuda:1 is: 20.929536819458008
Forward latency is: 26.099712371826172
Inner Forward latency at cuda:1 is: 19.733503341674805
Inner Forward latency at cuda:0 is: 20.28544044494629
Forward latency is: 25.09929656982422
Inner Forward latency at cuda:1 is: 20.525056838989258
Inner Forward latency at cuda:0 is: 22.49932861328125
Forward latency is: 26.862592697143555
Inner Forward latency at cuda:0 is: 21.86342430114746
Inner Forward latency at cuda:1 is: 22.064159393310547
Forward latency is: 26.846208572387695
Inner Forward latency at cuda:0 is: 19.932159423828125
Inner Forward latency at cuda:1 is: 20.315135955810547
Forward latency is: 25.975807189941406
Inner Forward latency at cuda:0 is: 19.63417625427246
Inner Forward latency at cuda:1 is: 20.179967880249023
Forward latency is: 25.976831436157227
Train Epoch: 1 Loss: 6.956287
I talked with @ngimel offline, but nothing rings a bell. Sorry that as we cannot reproduce this locally, we cannot help debug. If the slowdown is a concern, it might worth trying DistributedDataParallel.