Omroth
(Ian)
October 29, 2022, 7:13am
1
Dear experts,
I have a 3D convolutional pytorch network which is learning, but slowly. I am currently training on a gcloud instance of type “n1-standard-4” with 1 NVidia 16GB T4 and 4 virtual cpus.
I could improve this, and have up to 4 T4s. (I could also increase the number of virtual cpus). Would this be likely to speed up training immediately? If not, what would I need to do to take advantage of more gpus - turn on some kind of parallel training in Torch?
Thank you.
Omroth
(Ian)
October 29, 2022, 9:48am
2
Using the profiler, it appears the ADAM optimiser step is taking a lot of CPU time - is that expected?
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
Optimizer.step#Adam.step 8.96% 257.987ms 23.29% 670.413ms 670.413ms 582.000us 8.05% 1.875ms 1.875ms 1
Optimizer.step#Adam.step 6.45% 185.742ms 23.15% 666.236ms 666.236ms 1.472ms 20.35% 4.727ms 4.727ms 1
Optimizer.step#Adam.step 8.97% 258.049ms 23.12% 665.475ms 665.475ms 580.000us 8.02% 1.880ms 1.880ms 1
Optimizer.step#Adam.step 8.95% 257.567ms 23.07% 663.865ms 663.865ms 580.000us 8.02% 1.877ms 1.877ms 1
Optimizer.step#Adam.step 7.84% 225.699ms 23.03% 662.666ms 662.666ms 585.000us 8.09% 1.879ms 1.879ms 1
Optimizer.step#Adam.step 8.99% 258.702ms 22.95% 660.573ms 660.573ms 582.000us 8.05% 1.875ms 1.875ms 1
Optimizer.step#Adam.step 9.03% 259.876ms 22.95% 660.377ms 660.377ms 575.000us 7.95% 1.896ms 1.896ms 1
Optimizer.step#Adam.step 8.99% 258.798ms 22.93% 659.775ms 659.775ms 579.000us 8.00% 1.876ms 1.876ms 1
Optimizer.step#Adam.step 8.91% 256.324ms 22.91% 659.309ms 659.309ms 584.000us 8.07% 1.874ms 1.874ms 1
Optimizer.step#Adam.step 8.92% 256.716ms 22.88% 658.587ms 658.587ms 579.000us 8.00% 1.870ms 1.870ms 1
aten::to 0.00% 9.000us 13.98% 402.472ms 402.472ms 6.000us 0.08% 524.000us 524.000us 1
aten::_to_copy 0.00% 43.000us 13.98% 402.463ms 402.463ms 21.000us 0.29% 518.000us 518.000us 1
aten::copy_ 0.00% 19.000us 13.98% 402.417ms 402.417ms 496.000us 6.86% 496.000us 496.000us 1
cudaMemcpyAsync 13.98% 402.387ms 13.98% 402.387ms 402.387ms 0.000us 0.00% 0.000us 0.000us 1
aten::item 0.00% 30.000us 11.66% 335.706ms 335.706ms 12.000us 0.17% 78.000us 78.000us 1
You could use the fused Adam implementation in newer PyTorch version by using fused=True
in its initialization.
1 Like
Omroth
(Ian)
October 30, 2022, 9:03am
4
That helped a lot, thanks.
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
Optimizer.step#Adam.step 8.81% 221.228ms 29.93% 752.074ms 752.074ms 203.000us 1.15% 3.352ms 3.352ms 1
aten::copy_ 0.00% 18.000us 24.06% 604.390ms 604.390ms 44.000us 0.25% 44.000us 44.000us 1
cudaMemcpyAsync 24.05% 604.366ms 24.05% 604.366ms 604.366ms 0.000us 0.00% 0.000us 0.000us 1
aten::to 0.00% 10.000us 23.54% 591.447ms 591.447ms 5.000us 0.03% 5.308ms 5.308ms 1
aten::_to_copy 0.00% 37.000us 23.54% 591.437ms 591.437ms 14.000us 0.08% 5.303ms 5.303ms 1
aten::copy_ 0.00% 24.000us 23.54% 591.386ms 591.386ms 5.288ms 29.96% 5.288ms 5.288ms 1
cudaMemcpyAsync 23.54% 591.327ms 23.54% 591.327ms 591.327ms 0.000us 0.00% 0.000us 0.000us 1
aten::to 0.00% 10.000us 22.27% 559.481ms 559.481ms 5.000us 0.03% 6.444ms 6.444ms 1
aten::_to_copy 0.00% 30.000us 22.27% 559.471ms 559.471ms 11.000us 0.06% 6.439ms 6.439ms 1
aten::copy_ 0.00% 21.000us 22.27% 559.430ms 559.430ms 6.427ms 36.41% 6.427ms 6.427ms 1
cudaMemcpyAsync 22.26% 559.371ms 22.26% 559.371ms 559.371ms 0.000us 0.00% 0.000us 0.000us 1
aten::to 0.00% 8.000us 21.34% 536.074ms 536.074ms 5.000us 0.03% 5.654ms 5.654ms 1
aten::_to_copy 0.00% 28.000us 21.34% 536.066ms 536.066ms 10.000us 0.06% 5.649ms 5.649ms 1
aten::copy_ 0.00% 18.000us 21.33% 536.027ms 536.027ms 5.638ms 31.94% 5.638ms 5.638ms 1
cudaMemcpyAsync 21.33% 535.971ms 21.33% 535.971ms 535.971ms 0.000us 0.00% 0.000us 0.000us 1