Issue with using DataParallel (includes minimal code)

Hey @blade, Sorry about the delay. I submitted a PR to your repo that contains the debug code I added to get the following outputs. Several observations/comments:

  1. The original code measures the delay of the entire train() method. It will be more informative to look at the delay of an individual forward pass, as data preparation etc. might also contribute to the total training delay, and we would like to exclude those.
  2. In the outputs below, the Forward latency measures the delay of executing model(data) by using CUDA events to wrap the model(data) invocation. The Inner Forward latency measures the delay within the model.forward() method. For non-DP models, the two delays should be the same. For DP models, the Inner Forward latency should be smaller as it does not contain the model replication, input scattering and output gathering overhead.
  3. From the outputs, we can see that, the non-DP model’s inner forward pass takes about 11-12 ms to finish, while the DP model’s inner forward pass takes about 13-15 ms to finish, which looks reasonable as the additional delay might be caused by GIL contention.
  4. The DP’ outer forward pass takes around 17-18 ms, which suggests the replication + scattering + gathering takes additional 2-5 ms.
  5. If this additional overhead from DP is not acceptable, you might want to try DistributedDataParallel.
device_count() 2                    
get_device_name Quadro GP100                          
get_device_name Quadro GP100                          
===== Warn Up =====                   
Inner Forward latency at cuda:0 is: 1505.4864501953125
Forward latency is: 1505.63232421875                  
[W TensorIterator.cpp:917] Warning: Mixed memory format inputs detected while calling the operator. The operator will output contiguous tensor even if some of the in
puts are in channels_last format. (function operator())
[W TensorIterator.cpp:923] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the
 inputs are not in channels_last format. (function operator())
Inner Forward latency at cuda:0 is: 55.495487213134766
Forward latency is: 55.64147186279297                 
Inner Forward latency at cuda:0 is: 35.402400970458984
Forward latency is: 35.53887939453125                 
Inner Forward latency at cuda:0 is: 9.02233600616455  
Forward latency is: 9.138496398925781 
Inner Forward latency at cuda:0 is: 25.25040054321289
Forward latency is: 25.4737606048584           
Inner Forward latency at cuda:0 is: 11.507648468017578
Forward latency is: 11.665023803710938
Inner Forward latency at cuda:0 is: 10.732000350952148
Forward latency is: 10.859199523925781
Train Epoch: 1  Loss: 8.790261

===== Single GPU =====
Inner Forward latency at cuda:0 is: 11.000991821289062 
Forward latency is: 11.108160018920898 
Inner Forward latency at cuda:0 is: 11.089119911193848 
Forward latency is: 11.22646427154541 
Inner Forward latency at cuda:0 is: 11.005184173583984 
Forward latency is: 11.144991874694824 
Inner Forward latency at cuda:0 is: 11.082048416137695 
Forward latency is: 11.23363208770752 
Inner Forward latency at cuda:0 is: 12.89516830444336 
Forward latency is: 13.033023834228516 
Inner Forward latency at cuda:0 is: 11.095647811889648 
Forward latency is: 11.230015754699707 
Inner Forward latency at cuda:0 is: 10.849760055541992 
Forward latency is: 10.985152244567871 
Train Epoch: 1  Loss: 8.162259
===== DataParallel =====
Inner Forward latency at cuda:0 is: 32.145503997802734 
Inner Forward latency at cuda:1 is: 1494.096435546875 
Forward latency is: 3853.993896484375 
Inner Forward latency at cuda:0 is: 13.01587200164795 
Inner Forward latency at cuda:1 is: 15.880448341369629 
Forward latency is: 18.9583683013916 
Inner Forward latency at cuda:1 is: 13.992992401123047 
Inner Forward latency at cuda:0 is: 14.519583702087402 
Forward latency is: 17.375328063964844 
Inner Forward latency at cuda:0 is: 14.469663619995117 
Inner Forward latency at cuda:1 is: 14.402463912963867 
Forward latency is: 17.670656204223633 
Inner Forward latency at cuda:0 is: 13.701855659484863 
Inner Forward latency at cuda:1 is: 14.609951972961426 
Forward latency is: 17.782848358154297 
Inner Forward latency at cuda:0 is: 14.08131217956543 
Inner Forward latency at cuda:1 is: 14.001919746398926 
Forward latency is: 17.2807674407959 
Inner Forward latency at cuda:1 is: 13.710176467895508 
Inner Forward latency at cuda:0 is: 14.413311958312988 
Forward latency is: 17.411136627197266 
Train Epoch: 1  Loss: 7.542133

1 Like

Oh, one more thing, this warning might be a sign of real perf penalty.

cc memory format expert @VitalyFedyunin

Thank you very much @mrshenli for the thorough explanation. When running your modified code, I get the following error

Traceback (most recent call last):
File “main.py”, line 139, in
main()
File “main.py”, line 124, in main
model.train()
File “/afs/blade/Project_B/VAEtrain.py”, line 137, in train
self.trainepoch(epoch)
File “/afs/blade/Project_B/VAEtrain.py”, line 116, in trainepoch
print('Forward latency is: {} '.format(e_start.elapsed_time(e_finish)))
File “/afs/blade/.conda/envs/My_env/lib/python3.6/site-packages/torch/cuda/streams.py”, line 166, in elapsed_time
return super(Event, self).elapsed_time(end_event)
RuntimeError: CUDA error: device not ready

which I fixed by adding torch.cuda.synchronize() before each print statement (last two commit on Github). Not sure why it worked when you run it but not when I do. With 600 training data and batch size of 32 I get the following results:

device_count() 2
get_device_name Quadro RTX 6000
get_device_name Quadro RTX 6000
===== Warn Up =====
Inner Forward latency at cuda:0 is: 312.0291748046875 
Forward latency is: 312.20123291015625 
Inner Forward latency at cuda:0 is: 7.032832145690918 
Forward latency is: 7.154176235198975 
Inner Forward latency at cuda:0 is: 6.665823936462402 
Forward latency is: 6.760447978973389 
Inner Forward latency at cuda:0 is: 6.610943794250488 
Forward latency is: 6.698976039886475 
Inner Forward latency at cuda:0 is: 6.579264163970947 
Forward latency is: 6.668288230895996 
Inner Forward latency at cuda:0 is: 6.5924482345581055 
Forward latency is: 6.685696125030518 
Inner Forward latency at cuda:0 is: 6.5490241050720215 
Forward latency is: 6.63756799697876 
Inner Forward latency at cuda:0 is: 6.6170878410339355 
Forward latency is: 6.7051520347595215 
Inner Forward latency at cuda:0 is: 6.561567783355713 
Forward latency is: 6.649856090545654 
Inner Forward latency at cuda:0 is: 6.608831882476807 
Forward latency is: 6.696991920471191 
Inner Forward latency at cuda:0 is: 6.565536022186279 
Forward latency is: 6.652768135070801 
Inner Forward latency at cuda:0 is: 7.073791980743408 
Forward latency is: 7.168000221252441 
Inner Forward latency at cuda:0 is: 6.677855968475342 
Forward latency is: 6.764575958251953 
Inner Forward latency at cuda:0 is: 6.579904079437256 
Forward latency is: 6.668064117431641 
Inner Forward latency at cuda:0 is: 6.5595197677612305 
Forward latency is: 6.647808074951172 
Inner Forward latency at cuda:0 is: 7.110752105712891 
Forward latency is: 7.204095840454102 
Inner Forward latency at cuda:0 is: 6.617055892944336 
Forward latency is: 6.7060160636901855 
Inner Forward latency at cuda:0 is: 6.559423923492432 
Forward latency is: 6.6473917961120605 
Inner Forward latency at cuda:0 is: 6.67033576965332 
Forward latency is: 6.759903907775879 
Train Epoch: 1	Loss: 3.500732
===== Single GPU =====
Inner Forward latency at cuda:0 is: 11.719776153564453 
Forward latency is: 11.848064422607422 
Inner Forward latency at cuda:0 is: 6.635519981384277 
Forward latency is: 6.722879886627197 
Inner Forward latency at cuda:0 is: 6.606272220611572 
Forward latency is: 6.693664073944092 
Inner Forward latency at cuda:0 is: 6.573984146118164 
Forward latency is: 6.675968170166016 
Inner Forward latency at cuda:0 is: 6.516831874847412 
Forward latency is: 6.606847763061523 
Inner Forward latency at cuda:0 is: 6.608352184295654 
Forward latency is: 6.696959972381592 
Inner Forward latency at cuda:0 is: 6.593344211578369 
Forward latency is: 6.689760208129883 
Inner Forward latency at cuda:0 is: 38.63993453979492 
Forward latency is: 38.72832107543945 
Inner Forward latency at cuda:0 is: 6.6211838722229 
Forward latency is: 6.716095924377441 
Inner Forward latency at cuda:0 is: 6.551616191864014 
Forward latency is: 6.639872074127197 
Inner Forward latency at cuda:0 is: 7.57587194442749 
Forward latency is: 7.692992210388184 
Inner Forward latency at cuda:0 is: 6.579360008239746 
Forward latency is: 6.665279865264893 
Inner Forward latency at cuda:0 is: 58.9370231628418 
Forward latency is: 59.02336120605469 
Inner Forward latency at cuda:0 is: 6.718719959259033 
Forward latency is: 6.813632011413574 
Inner Forward latency at cuda:0 is: 6.606624126434326 
Forward latency is: 6.692448139190674 
Inner Forward latency at cuda:0 is: 6.61740779876709 
Forward latency is: 6.706719875335693 
Inner Forward latency at cuda:0 is: 6.545951843261719 
Forward latency is: 6.634687900543213 
Inner Forward latency at cuda:0 is: 6.5739521980285645 
Forward latency is: 6.660799980163574 
Inner Forward latency at cuda:0 is: 6.364831924438477 
Forward latency is: 6.4543681144714355 
Train Epoch: 1	Loss: 2.285649
===== DataParallel =====
Inner Forward latency at cuda:0 is: 42.26464080810547 
Inner Forward latency at cuda:1 is: 393.2672119140625 
Forward latency is: 4947.4130859375 
Inner Forward latency at cuda:0 is: 17.808063507080078 
Inner Forward latency at cuda:1 is: 22.410879135131836 
Forward latency is: 26.27788734436035 
Inner Forward latency at cuda:1 is: 20.338655471801758 
Inner Forward latency at cuda:0 is: 22.937599182128906 
Forward latency is: 26.417760848999023 
Inner Forward latency at cuda:1 is: 18.742080688476562 
Inner Forward latency at cuda:0 is: 21.928159713745117 
Forward latency is: 25.05731201171875 
Inner Forward latency at cuda:1 is: 13.354111671447754 
Inner Forward latency at cuda:0 is: 16.105663299560547 
Forward latency is: 19.50511932373047 
Inner Forward latency at cuda:0 is: 21.91974449157715 
Inner Forward latency at cuda:1 is: 23.952512741088867 
Forward latency is: 27.585535049438477 
Inner Forward latency at cuda:0 is: 22.37603187561035 
Inner Forward latency at cuda:1 is: 22.983200073242188 
Forward latency is: 26.94927978515625 
Inner Forward latency at cuda:1 is: 18.544095993041992 
Inner Forward latency at cuda:0 is: 20.25881576538086 
Forward latency is: 23.73529624938965 
Inner Forward latency at cuda:0 is: 20.680864334106445 
Inner Forward latency at cuda:1 is: 22.002399444580078 
Forward latency is: 26.0317440032959 
Inner Forward latency at cuda:0 is: 19.656959533691406 
Inner Forward latency at cuda:1 is: 21.027456283569336 
Forward latency is: 24.815616607666016 
Inner Forward latency at cuda:0 is: 22.988128662109375 
Inner Forward latency at cuda:1 is: 23.180959701538086 
Forward latency is: 27.598848342895508 
Inner Forward latency at cuda:1 is: 21.547008514404297 
Inner Forward latency at cuda:0 is: 22.051231384277344 
Forward latency is: 25.91289520263672 
Inner Forward latency at cuda:1 is: 19.94380760192871 
Inner Forward latency at cuda:0 is: 22.607072830200195 
Forward latency is: 26.406591415405273 
Inner Forward latency at cuda:0 is: 22.456287384033203 
Inner Forward latency at cuda:1 is: 23.465984344482422 
Forward latency is: 27.797727584838867 
Inner Forward latency at cuda:1 is: 21.703231811523438 
Inner Forward latency at cuda:0 is: 22.716032028198242 
Forward latency is: 27.013696670532227 
Inner Forward latency at cuda:0 is: 23.64579200744629 
Inner Forward latency at cuda:1 is: 24.85055923461914 
Forward latency is: 29.45417594909668 
Inner Forward latency at cuda:1 is: 24.470399856567383 
Inner Forward latency at cuda:0 is: 26.32054328918457 
Forward latency is: 30.444063186645508 
Inner Forward latency at cuda:0 is: 21.156511306762695 
Inner Forward latency at cuda:1 is: 22.762048721313477 
Forward latency is: 27.273216247558594 
Inner Forward latency at cuda:0 is: 25.427295684814453 
Inner Forward latency at cuda:1 is: 25.92563247680664 
Forward latency is: 30.507776260375977 
Train Epoch: 1	Loss: 1.831364

As you can see, when I run non-DP model, the inner forward pass takes about 6-7 ms while it takes the DP model 21-23 ms to run. This seems a way larger gap. Does this look alright? I’m not sure if synchronizing before the print statement causes this or not, but I was not able to run the code without that change. May I ask how many training data and what batch size you are using so that we compare the same results?

Hey @blade

Good catch. I added e_finish.synchronize() before calling elapsed_time to fix the problem. It works with the prior code might be due to lucky timing, i.e., kernels are all done when consulting the event. I still see similar result (outputs pasted to the end of this post).

As you can see, when I run non-DP model, the inner forward pass takes about 6-7 ms while it takes the DP model 21-23 ms to run. This seems a way larger gap. Does this look alright?

This does not look correct to me, but I cannot explain. The DP thread shouldn’t take 4X for every forward pass. BTW, what is your GPU type and what is the CUDA version?

cc CUDA expert @ngimel if this rings a bell.

May I ask how many training data and what batch size you are using so that we compare the same results?

I just run the code as-is using python main.py . I copied the .data file to data folder as it seems the program expects to read it from there.

===== Warn Up =====                   
Inner Forward latency at cuda:0 is: 1488.1781005859375
Forward latency is: 1488.338134765625                
[W TensorIterator.cpp:917] Warning: Mixed memory format inputs detected while calling the operator. The operator will output contiguous tensor even if some of the in
puts are in channels_last format. (function operator())
[W TensorIterator.cpp:923] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the
 inputs are not in channels_last format. (function operator())
Inner Forward latency at cuda:0 is: 58.0959358215332  
Forward latency is: 58.23100662231445                 
Inner Forward latency at cuda:0 is: 8.668992042541504 
Forward latency is: 8.790816307067871                 
Inner Forward latency at cuda:0 is: 8.797216415405273 
Forward latency is: 8.929183959960938 
Inner Forward latency at cuda:0 is: 18.98464012145996
Forward latency is: 19.132671356201172
Inner Forward latency at cuda:0 is: 34.01536178588867 
Forward latency is: 34.20316696166992                 
Inner Forward latency at cuda:0 is: 8.927007675170898
Forward latency is: 9.08579158782959                  
Train Epoch: 1  Loss: 8.790261                        
===== Single GPU =====                               
Inner Forward latency at cuda:0 is: 8.95967960357666  
Forward latency is: 9.068127632141113                 
Inner Forward latency at cuda:0 is: 8.704704284667969 
Forward latency is: 8.856863975524902                 
Inner Forward latency at cuda:0 is: 8.709856033325195 
Forward latency is: 8.831232070922852                 
Inner Forward latency at cuda:0 is: 8.72275161743164  
Forward latency is: 8.845024108886719                 
Inner Forward latency at cuda:0 is: 18.166719436645508
Forward latency is: 18.28998374938965                 
Inner Forward latency at cuda:0 is: 8.70742416381836  
Forward latency is: 8.828767776489258                 
Inner Forward latency at cuda:0 is: 8.335904121398926 
Forward latency is: 8.45967960357666                  
Train Epoch: 1  Loss: 8.162259   
===== DataParallel Warn Up =====
Inner Forward latency at cuda:0 is: 29.18454360961914
Inner Forward latency at cuda:1 is: 1496.7349853515625
Forward latency is: 3831.19189453125
Inner Forward latency at cuda:1 is: 15.559136390686035
Inner Forward latency at cuda:0 is: 15.93996810913086
Forward latency is: 19.217119216918945
Inner Forward latency at cuda:0 is: 13.919391632080078
Inner Forward latency at cuda:1 is: 14.139103889465332
Forward latency is: 17.232608795166016
Inner Forward latency at cuda:1 is: 14.457535743713379
Inner Forward latency at cuda:0 is: 14.94876766204834
Forward latency is: 17.987136840820312
Inner Forward latency at cuda:1 is: 13.798944473266602
Inner Forward latency at cuda:0 is: 14.654208183288574
Forward latency is: 17.719648361206055
Inner Forward latency at cuda:1 is: 13.641887664794922
Inner Forward latency at cuda:0 is: 14.081600189208984
Forward latency is: 17.102848052978516
Inner Forward latency at cuda:0 is: 13.795040130615234
Inner Forward latency at cuda:1 is: 13.760607719421387
Forward latency is: 18.731584548950195
Train Epoch: 1  Loss: 7.542133
===== DataParallel =====
Inner Forward latency at cuda:0 is: 13.752639770507812
Inner Forward latency at cuda:1 is: 14.990143775939941
Forward latency is: 18.172191619873047
Inner Forward latency at cuda:0 is: 13.295552253723145
Inner Forward latency at cuda:1 is: 13.322591781616211
Forward latency is: 16.823423385620117
Inner Forward latency at cuda:0 is: 13.860544204711914
Inner Forward latency at cuda:1 is: 14.084735870361328
Forward latency is: 17.313791275024414
Inner Forward latency at cuda:0 is: 13.091039657592773
Inner Forward latency at cuda:1 is: 12.956319808959961
Forward latency is: 16.153152465820312
Inner Forward latency at cuda:1 is: 13.988863945007324
Inner Forward latency at cuda:0 is: 14.459775924682617
Forward latency is: 17.44748878479004
Inner Forward latency at cuda:0 is: 13.118144035339355
Inner Forward latency at cuda:1 is: 13.629311561584473
Forward latency is: 16.748767852783203
Inner Forward latency at cuda:1 is: 13.413951873779297
Inner Forward latency at cuda:0 is: 14.655072212219238
Forward latency is: 17.669599533081055
Train Epoch: 1  Loss: 6.871040

1 Like

Thanks @mrshenli

I’m using RTX6000 GPU cards with 24GB of memory. The CUDA version is 10.0. I’m recreating with e_finish.synchronize() and default batch/data size to be consistent with you.

device_count() 2
get_device_name Quadro RTX 6000
get_device_name Quadro RTX 6000
===== Warn Up =====
Inner Forward latency at cuda:0 is: 269.352783203125 
Forward latency is: 269.5054931640625 
Inner Forward latency at cuda:0 is: 6.518432140350342 
Forward latency is: 6.623968124389648 
Inner Forward latency at cuda:0 is: 6.06825590133667 
Forward latency is: 6.1439361572265625 
Inner Forward latency at cuda:0 is: 6.148096084594727 
Forward latency is: 6.224703788757324 
Inner Forward latency at cuda:0 is: 6.141952037811279 
Forward latency is: 6.215456008911133 
Inner Forward latency at cuda:0 is: 6.092895984649658 
Forward latency is: 6.170144081115723 
Inner Forward latency at cuda:0 is: 5.781504154205322 
Forward latency is: 5.854623794555664 
Train Epoch: 1	Loss: 8.880923
===== Single GPU =====
Inner Forward latency at cuda:0 is: 6.268703937530518 
Forward latency is: 6.322175979614258 
Inner Forward latency at cuda:0 is: 6.090367794036865 
Forward latency is: 6.166687965393066 
Inner Forward latency at cuda:0 is: 6.139904022216797 
Forward latency is: 6.215072154998779 
Inner Forward latency at cuda:0 is: 6.092895984649658 
Forward latency is: 6.168735980987549 
Inner Forward latency at cuda:0 is: 6.106112003326416 
Forward latency is: 6.18614387512207 
Inner Forward latency at cuda:0 is: 6.128384113311768 
Forward latency is: 6.200736045837402 
Inner Forward latency at cuda:0 is: 5.888192176818848 
Forward latency is: 5.9773759841918945 
Train Epoch: 1	Loss: 7.826999
===== DataParallel =====
Inner Forward latency at cuda:0 is: 29.677568435668945 
Inner Forward latency at cuda:1 is: 247.988037109375 
Forward latency is: 5157.865234375 
Inner Forward latency at cuda:0 is: 20.65011215209961 
Inner Forward latency at cuda:1 is: 22.26380729675293 
Forward latency is: 26.309247970581055 
Inner Forward latency at cuda:0 is: 18.71731185913086 
Inner Forward latency at cuda:1 is: 20.099807739257812 
Forward latency is: 25.069536209106445 
Inner Forward latency at cuda:1 is: 18.127199172973633 
Inner Forward latency at cuda:0 is: 21.228160858154297 
Forward latency is: 27.178943634033203 
Inner Forward latency at cuda:0 is: 22.07948875427246 
Inner Forward latency at cuda:1 is: 21.940223693847656 
Forward latency is: 28.237247467041016 
Inner Forward latency at cuda:0 is: 22.045312881469727 
Inner Forward latency at cuda:1 is: 22.9520320892334 
Forward latency is: 28.17840003967285 
Inner Forward latency at cuda:1 is: 19.71459197998047 
Inner Forward latency at cuda:0 is: 20.302751541137695 
Forward latency is: 26.347999572753906 
Train Epoch: 1	Loss: 6.633629

I also do have access to V100 GPU cards with 32GB of memory. The output with those is the following.

device_count() 2
get_device_name Tesla V100-PCIE-32GB
get_device_name Tesla V100-PCIE-32GB
===== Warn Up =====
Inner Forward latency at cuda:0 is: 330.8124084472656 
Forward latency is: 330.9803771972656 
Inner Forward latency at cuda:0 is: 6.412288188934326 
Forward latency is: 6.533120155334473 
Inner Forward latency at cuda:0 is: 6.1184000968933105 
Forward latency is: 6.200319766998291 
Inner Forward latency at cuda:0 is: 6.159359931945801 
Forward latency is: 6.237184047698975 
Inner Forward latency at cuda:0 is: 18.338815689086914 
Forward latency is: 18.41868782043457 
Inner Forward latency at cuda:0 is: 6.0641279220581055 
Forward latency is: 6.145023822784424 
Inner Forward latency at cuda:0 is: 5.8081278800964355 
Forward latency is: 5.885951995849609 
Train Epoch: 1	Loss: 8.888597
===== Single GPU =====
Inner Forward latency at cuda:0 is: 6.325247764587402 
Forward latency is: 6.380544185638428 
Inner Forward latency at cuda:0 is: 6.120448112487793 
Forward latency is: 6.201344013214111 
Inner Forward latency at cuda:0 is: 6.12659215927124 
Forward latency is: 6.204415798187256 
Inner Forward latency at cuda:0 is: 6.134751796722412 
Forward latency is: 6.215680122375488 
Inner Forward latency at cuda:0 is: 6.090752124786377 
Forward latency is: 6.168575763702393 
Inner Forward latency at cuda:0 is: 6.11737585067749 
Forward latency is: 6.193151950836182 
Inner Forward latency at cuda:0 is: 66.82112121582031 
Forward latency is: 66.90201568603516 
Train Epoch: 1	Loss: 7.709968
===== DataParallel =====
Inner Forward latency at cuda:1 is: 268.62591552734375 
Inner Forward latency at cuda:0 is: 275.9311218261719 
Forward latency is: 5106.107421875 
Inner Forward latency at cuda:0 is: 18.130943298339844 
Inner Forward latency at cuda:1 is: 20.929536819458008 
Forward latency is: 26.099712371826172 
Inner Forward latency at cuda:1 is: 19.733503341674805 
Inner Forward latency at cuda:0 is: 20.28544044494629 
Forward latency is: 25.09929656982422 
Inner Forward latency at cuda:1 is: 20.525056838989258 
Inner Forward latency at cuda:0 is: 22.49932861328125 
Forward latency is: 26.862592697143555 
Inner Forward latency at cuda:0 is: 21.86342430114746 
Inner Forward latency at cuda:1 is: 22.064159393310547 
Forward latency is: 26.846208572387695 
Inner Forward latency at cuda:0 is: 19.932159423828125 
Inner Forward latency at cuda:1 is: 20.315135955810547 
Forward latency is: 25.975807189941406 
Inner Forward latency at cuda:0 is: 19.63417625427246 
Inner Forward latency at cuda:1 is: 20.179967880249023 
Forward latency is: 25.976831436157227 
Train Epoch: 1	Loss: 6.956287
1 Like

Hey @blade

I talked with @ngimel offline, but nothing rings a bell. Sorry that as we cannot reproduce this locally, we cannot help debug. If the slowdown is a concern, it might worth trying DistributedDataParallel.

1 Like