GPU util has 0-100% fluctuation

I’m using 2 GPUs. but all GPU util has fluctuation during NLP training.
I’v been monitoring this using Nvidia-smi.
If there are some solution for high GPU util, I expect training time could be reduced.
I just suspect that large dataset from loaded joblib pkl file might be a factor, so I’ve tried some solutions.
I noticed that when I slice input data, speed of each batch training increases. Is it just a given?

I’v already tried followings

  1. num_worker > 0~14 setting
  2. pin_memory = True setting
  3. All data on the memory (dataset/init)
  4. When checked CPU usage, it wasn’t that high

Are there any other disconnects that could make things better?
Please help this beginner.
Thanks.

Generally, I would recommend profiling your use case before starting to apply optimizations without knowing where the actual bottleneck is. The performance guide and this post might also be helpful.

Thank you for quick answer.
I’ll refer to the links, it will be helpful for me.
Actually, I’ve already profiling this model, but I couldn’t fully understand the result. with profiler.profile
Profiler didn’t show about dataloader despite dataloader under the with profiler.profile, did I do it right?
In the result, it show 20% of the cudaFree time why appeared on CPU time? is it related about GPU util?
The addmm layer is where most of the time is spent in CUDA. I think but this one wouldn’t solve GPU util flucuation because this process would be on the GPU.
This may be a stupid question since I’m not familiar with Pytorch yet, but I’d like to address this issue.

-------------------------------------------------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------
Name		Self CPU %		Self CPU		CPU total %		CPU total		CPU time avg		Self CUDA		Self CUDA %		CUDA total		CUDA time avg		CPU Mem		Self CPU Mem		CUDA Mem		Self CUDA M		em    # of Calls
-------------------------------------------------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------
DistributedDataParallel.forward		1.58%		82.714ms		79.08%		4.153s		2.076s		14.957ms		0.29%		4.239s		2.120s		184 b		-26.79 Mb		2.86 Gb		-2.52 Gb		2
aten::linear		0.60%		31.512ms		21.15%		1.111s		7.506ms		2.990ms		0.06%		1.270s		8.580ms		0 b		0 b		879.35 Mb		0 b		148
aten::addmm		0.20%		10.647ms		20.40%		1.071s		7.238ms		1.262s		24.35%		1.263s		8.535ms		0 b		0 b		879.35 Mb		731.35 Mb		148
cudaFree		20.00%		1.050s		20.00%		1.050s		350.069ms		0.000us		0.00%		0.000us		0.000us		0 b		0 b		0 b		0 b		3
aten::div		8.93%		468.917ms		9.13%		479.685ms		1.057ms		483.799ms		9.33%		483.984ms		1.066ms		1.95 Mb		1.95 Mb		1.94 Gb		1.94 Gb		454
aten::add		7.38%		387.274ms		7.40%		388.333ms		1.961ms		405.858ms		7.83%		405.858ms		2.050ms		3.89 Mb		3.89 Mb		1.65 Gb		1.65 Gb		198
aten::abs		3.54%		185.689ms		7.07%		371.299ms		92.825ms		185.689ms		3.58%		371.333ms		92.833ms		3.89 Mb		1.95 Mb		0 b		0 b		4
aten::mul		5.72%		300.190ms		6.98%		366.603ms		844.707us		313.915ms		6.06%		314.000ms		723.502us		2.92 Mb		2.92 Mb		575.64 Mb		575.64 Mb		434
autograd::engine::evaluate_function: torch::autograd...		1.09%		57.130ms		6.61%		347.215ms		872.399us		11.044ms		0.21%		140.752ms		353.648us		0 b		0 b		-421.77 Mb		-421.77 Mb		398
aten::lt		5.33%		279.649ms		5.33%		279.649ms		69.912ms		279.704ms		5.40%		279.704ms		69.926ms		498.04 Kb		498.04 Kb		0 b		0 b		4
-------------------------------------------------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------		------------
Self CPU time total: 5.251s																												
Self CUDA time total: 5.183s																												
												

I would recommend profiling the workload with a visual profiler creating a proper timeline view with the kernel launches and execution times. Nsight Systems is the one I’m always using, but I also think the native PyTorch profiler is able to do so creating an output viewable in the browser.
This will allow you to actually see how long each phase takes etc. An example can be found here.