I’m using 2 GPUs. but all GPU util has fluctuation during NLP training.
I’v been monitoring this using Nvidia-smi.
If there are some solution for high GPU util, I expect training time could be reduced.
I just suspect that large dataset from loaded joblib pkl file might be a factor, so I’ve tried some solutions.
I noticed that when I slice input data, speed of each batch training increases. Is it just a given?
I’v already tried followings
- num_worker > 0~14 setting
- pin_memory = True setting
- All data on the memory (dataset/init)
- When checked CPU usage, it wasn’t that high
Are there any other disconnects that could make things better?
Please help this beginner.
Thanks.
Generally, I would recommend profiling your use case before starting to apply optimizations without knowing where the actual bottleneck is. The performance guide and this post might also be helpful.
Thank you for quick answer.
I’ll refer to the links, it will be helpful for me.
Actually, I’ve already profiling this model, but I couldn’t fully understand the result. with profiler.profile
Profiler didn’t show about dataloader despite dataloader under the with profiler.profile
, did I do it right?
In the result, it show 20% of the cudaFree time why appeared on CPU time? is it related about GPU util?
The addmm layer is where most of the time is spent in CUDA. I think but this one wouldn’t solve GPU util flucuation because this process would be on the GPU.
This may be a stupid question since I’m not familiar with Pytorch yet, but I’d like to address this issue.
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA M em # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
DistributedDataParallel.forward 1.58% 82.714ms 79.08% 4.153s 2.076s 14.957ms 0.29% 4.239s 2.120s 184 b -26.79 Mb 2.86 Gb -2.52 Gb 2
aten::linear 0.60% 31.512ms 21.15% 1.111s 7.506ms 2.990ms 0.06% 1.270s 8.580ms 0 b 0 b 879.35 Mb 0 b 148
aten::addmm 0.20% 10.647ms 20.40% 1.071s 7.238ms 1.262s 24.35% 1.263s 8.535ms 0 b 0 b 879.35 Mb 731.35 Mb 148
cudaFree 20.00% 1.050s 20.00% 1.050s 350.069ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 3
aten::div 8.93% 468.917ms 9.13% 479.685ms 1.057ms 483.799ms 9.33% 483.984ms 1.066ms 1.95 Mb 1.95 Mb 1.94 Gb 1.94 Gb 454
aten::add 7.38% 387.274ms 7.40% 388.333ms 1.961ms 405.858ms 7.83% 405.858ms 2.050ms 3.89 Mb 3.89 Mb 1.65 Gb 1.65 Gb 198
aten::abs 3.54% 185.689ms 7.07% 371.299ms 92.825ms 185.689ms 3.58% 371.333ms 92.833ms 3.89 Mb 1.95 Mb 0 b 0 b 4
aten::mul 5.72% 300.190ms 6.98% 366.603ms 844.707us 313.915ms 6.06% 314.000ms 723.502us 2.92 Mb 2.92 Mb 575.64 Mb 575.64 Mb 434
autograd::engine::evaluate_function: torch::autograd... 1.09% 57.130ms 6.61% 347.215ms 872.399us 11.044ms 0.21% 140.752ms 353.648us 0 b 0 b -421.77 Mb -421.77 Mb 398
aten::lt 5.33% 279.649ms 5.33% 279.649ms 69.912ms 279.704ms 5.40% 279.704ms 69.926ms 498.04 Kb 498.04 Kb 0 b 0 b 4
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 5.251s
Self CUDA time total: 5.183s
I would recommend profiling the workload with a visual profiler creating a proper timeline view with the kernel launches and execution times. Nsight Systems is the one I’m always using, but I also think the native PyTorch profiler is able to do so creating an output viewable in the browser.
This will allow you to actually see how long each phase takes etc. An example can be found here.