Speeding up training with gcloud instance

Dear experts,

I have a 3D convolutional pytorch network which is learning, but slowly. I am currently training on a gcloud instance of type “n1-standard-4” with 1 NVidia 16GB T4 and 4 virtual cpus.

I could improve this, and have up to 4 T4s. (I could also increase the number of virtual cpus). Would this be likely to speed up training immediately? If not, what would I need to do to take advantage of more gpus - turn on some kind of parallel training in Torch?

Thank you.

Using the profiler, it appears the ADAM optimiser step is taking a lot of CPU time - is that expected?


                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  

Optimizer.step#Adam.step         8.96%     257.987ms        23.29%     670.413ms     670.413ms     582.000us         8.05%       1.875ms       1.875ms             1  
Optimizer.step#Adam.step         6.45%     185.742ms        23.15%     666.236ms     666.236ms       1.472ms        20.35%       4.727ms       4.727ms             1  
Optimizer.step#Adam.step         8.97%     258.049ms        23.12%     665.475ms     665.475ms     580.000us         8.02%       1.880ms       1.880ms             1  
Optimizer.step#Adam.step         8.95%     257.567ms        23.07%     663.865ms     663.865ms     580.000us         8.02%       1.877ms       1.877ms             1  
Optimizer.step#Adam.step         7.84%     225.699ms        23.03%     662.666ms     662.666ms     585.000us         8.09%       1.879ms       1.879ms             1  
Optimizer.step#Adam.step         8.99%     258.702ms        22.95%     660.573ms     660.573ms     582.000us         8.05%       1.875ms       1.875ms             1  
Optimizer.step#Adam.step         9.03%     259.876ms        22.95%     660.377ms     660.377ms     575.000us         7.95%       1.896ms       1.896ms             1  
Optimizer.step#Adam.step         8.99%     258.798ms        22.93%     659.775ms     659.775ms     579.000us         8.00%       1.876ms       1.876ms             1  
Optimizer.step#Adam.step         8.91%     256.324ms        22.91%     659.309ms     659.309ms     584.000us         8.07%       1.874ms       1.874ms             1  
Optimizer.step#Adam.step         8.92%     256.716ms        22.88%     658.587ms     658.587ms     579.000us         8.00%       1.870ms       1.870ms             1  
                aten::to         0.00%       9.000us        13.98%     402.472ms     402.472ms       6.000us         0.08%     524.000us     524.000us             1  
          aten::_to_copy         0.00%      43.000us        13.98%     402.463ms     402.463ms      21.000us         0.29%     518.000us     518.000us             1  
             aten::copy_         0.00%      19.000us        13.98%     402.417ms     402.417ms     496.000us         6.86%     496.000us     496.000us             1  
         cudaMemcpyAsync        13.98%     402.387ms        13.98%     402.387ms     402.387ms       0.000us         0.00%       0.000us       0.000us             1  
              aten::item         0.00%      30.000us        11.66%     335.706ms     335.706ms      12.000us         0.17%      78.000us      78.000us             1  

You could use the fused Adam implementation in newer PyTorch version by using fused=True in its initialization.

1 Like

That helped a lot, thanks.


                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  

Optimizer.step#Adam.step         8.81%     221.228ms        29.93%     752.074ms     752.074ms     203.000us         1.15%       3.352ms       3.352ms             1  
             aten::copy_         0.00%      18.000us        24.06%     604.390ms     604.390ms      44.000us         0.25%      44.000us      44.000us             1  
         cudaMemcpyAsync        24.05%     604.366ms        24.05%     604.366ms     604.366ms       0.000us         0.00%       0.000us       0.000us             1  
                aten::to         0.00%      10.000us        23.54%     591.447ms     591.447ms       5.000us         0.03%       5.308ms       5.308ms             1  
          aten::_to_copy         0.00%      37.000us        23.54%     591.437ms     591.437ms      14.000us         0.08%       5.303ms       5.303ms             1  
             aten::copy_         0.00%      24.000us        23.54%     591.386ms     591.386ms       5.288ms        29.96%       5.288ms       5.288ms             1  
         cudaMemcpyAsync        23.54%     591.327ms        23.54%     591.327ms     591.327ms       0.000us         0.00%       0.000us       0.000us             1  
                aten::to         0.00%      10.000us        22.27%     559.481ms     559.481ms       5.000us         0.03%       6.444ms       6.444ms             1  
          aten::_to_copy         0.00%      30.000us        22.27%     559.471ms     559.471ms      11.000us         0.06%       6.439ms       6.439ms             1  
             aten::copy_         0.00%      21.000us        22.27%     559.430ms     559.430ms       6.427ms        36.41%       6.427ms       6.427ms             1  
         cudaMemcpyAsync        22.26%     559.371ms        22.26%     559.371ms     559.371ms       0.000us         0.00%       0.000us       0.000us             1  
                aten::to         0.00%       8.000us        21.34%     536.074ms     536.074ms       5.000us         0.03%       5.654ms       5.654ms             1  
          aten::_to_copy         0.00%      28.000us        21.34%     536.066ms     536.066ms      10.000us         0.06%       5.649ms       5.649ms             1  
             aten::copy_         0.00%      18.000us        21.33%     536.027ms     536.027ms       5.638ms        31.94%       5.638ms       5.638ms             1  
         cudaMemcpyAsync        21.33%     535.971ms        21.33%     535.971ms     535.971ms       0.000us         0.00%       0.000us       0.000us             1