I saw a very strange behavior during training. The Training slows down after a few steps (generally after 50% steps in the first epoch). At this point the GPU utilization (of 6 or 7 out of 8 GPUs) goes to 100% from 90%. While the remaining 1 or 2 of the GPUs go down to 0% utilization. No matter what training configuration i set, i see the same behavior. In this way i have one epoch in approximately 4 hours, instead of less than 10 minutes.
Environment:
Python Version: 3.6.9
[GCC 8.4.0]
PyTorch Version: 1.8.1+cu102
Number of GPUs: 8
I have no idea where is the bottleneck? Does anyone face similar problem?
Assuming you are using DistributedDataParallel the 100% util. could point towards the GPUs waiting in NCCL calls for other devices to be ready. In that case the training steps of all devices seem to diverge and you could profile it with e.g. Nsight Systems. Since this effect seems to be triggered after some time, you could also check the GPU temperatures and make sure that no devices are downclocked due to excessive heat etc.
Yes i am talking about DistributedDataParallel, but the strange thing is that it happens even without DistributedDataParallel i.e., using a single GPU. This means that the problem doesn’t come from NCCL calls (correct me if i am wrong).
In case of single GPU after some 30000 to 40000 samples being processed, the training gets stuck, and the GPU utilization goes to 0% from 95%, whereas the CPU utilization fluctuates between minimum and 100%.
The temperature of the GPU remains around 70C at 95% usage (at smooth training), which is normal, i think, and is around 55C at 0% usage (when the problem occurs).
When this problem occurs, it also effects other processes on the machine. For example, i am processing the VoxCeleb dataset, which slows down to minimum when this problem occurs.
Also, i have tried with different versions of PyTorch, i.e., 1.6.0, 1.7.0, 1.8.1, and the latest 1.9.0. All behave the same.
Also, i have checked on different machines, with different GPUs (Machine 1 with Tesla T4, and Machine 2 with Titan RTX), i faced the same problem.
By the way i am trying to train this model, where other people do not have this problem. I have no idea why it happens in my case?
Thanks for the update.
Based on the description it seems that something might indeed be throttling your system.
Often the data loading might be a bottleneck and would let your GPU starve. However, since you are seeing the slowdown for all processes, I guess it’s affecting the entire system not only your PyTorch training script.
I still think a proper profile would be helpful to see, when the performance drops and which calls are executed. Also, you might consider stress testing your system and see, if it throttles for specific workloads (e.g. reduces the CPU freq. etc.).
@ptrblck I tried with torch.utils.bottleneck and following is the output: Note: I ran the script for 1 epoch only, as you suggested here, and this is printed without starting the training (i.e., before the problem occurred).
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.9.0+cu111 DEBUG compiled w/ CUDA 11.1
Running with Python 3.6 and
`pip3 list` truncated output:
numpy==1.18.1
torch==1.9.0+cu111
torchaudio==0.9.0
torchvision==0.10.0+cu111
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
15127188 function calls (15111211 primitive calls) in 8.870 seconds
Ordered by: internal time
List reduced from 2747 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
289 3.666 0.013 3.666 0.013 {method 'cuda' of 'torch._C._TensorBase' objects}
1263775 1.174 0.000 1.905 0.000 /usr/lib/python3.6/posixpath.py:75(join)
1 1.069 1.069 4.304 4.304 /DatasetLoader.py:114(__init__)
2442625 0.395 0.000 0.395 0.000 {method 'split' of 'str' objects}
1264426 0.223 0.000 0.365 0.000 /usr/lib/python3.6/posixpath.py:41(_get_sep)
2560595 0.171 0.000 0.171 0.000 {method 'append' of 'list' objects}
1304392/1303795 0.149 0.000 0.155 0.000 {built-in method builtins.isinstance}
173823 0.144 0.000 0.151 0.000 /usr/lib/python3.6/glob.py:114(_iterdir)
1265133 0.128 0.000 0.128 0.000 {method 'endswith' of 'str' objects}
1270598 0.127 0.000 0.127 0.000 {method 'startswith' of 'str' objects}
1 0.117 0.117 0.288 0.288 /DatasetLoader.py:129(<listcomp>)
1268919 0.113 0.000 0.113 0.000 {built-in method posix.fspath}
2 0.110 0.055 0.115 0.057 {method 'readlines' of '_io._IOBase' objects}
1092599 0.091 0.000 0.091 0.000 {method 'strip' of 'str' objects}
1 0.066 0.066 0.775 0.775 /DatasetLoader.py:62(__init__)
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
------------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::normal_ 9.89% 1.233ms 9.89% 1.233ms 1.233ms 1
aten::normal_ 9.61% 1.198ms 9.61% 1.198ms 1.198ms 1
aten::normal_ 9.54% 1.190ms 9.54% 1.190ms 1.190ms 1
aten::normal_ 9.48% 1.182ms 9.48% 1.182ms 1.182ms 1
aten::normal_ 9.42% 1.174ms 9.42% 1.174ms 1.174ms 1
aten::uniform_ 6.81% 848.913us 6.81% 848.913us 848.913us 1
aten::uniform_ 6.35% 791.197us 6.35% 791.197us 791.197us 1
aten::uniform_ 6.26% 780.781us 6.26% 780.781us 780.781us 1
aten::uniform_ 5.96% 743.103us 5.96% 743.103us 743.103us 1
aten::uniform_ 5.94% 740.487us 5.94% 740.487us 740.487us 1
aten::uniform_ 5.64% 703.088us 5.64% 703.088us 703.088us 1
aten::normal_ 5.29% 660.099us 5.29% 660.099us 660.099us 1
aten::max 3.30% 411.601us 3.43% 427.419us 427.419us 1
aten::normal_ 3.29% 410.359us 3.29% 410.359us 410.359us 1
aten::normal_ 3.21% 400.519us 3.21% 400.519us 400.519us 1
------------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 12.467ms
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::normal_ 9.13% 1.198ms 9.13% 1.198ms 1.198ms 1.198ms 9.13% 1.198ms 1.198ms 1
aten::normal_ 8.93% 1.172ms 8.93% 1.172ms 1.172ms 1.173ms 8.94% 1.173ms 1.173ms 1
aten::normal_ 8.93% 1.172ms 8.93% 1.172ms 1.172ms 1.171ms 8.93% 1.171ms 1.171ms 1
aten::normal_ 8.86% 1.163ms 8.86% 1.163ms 1.163ms 1.163ms 8.86% 1.163ms 1.163ms 1
aten::normal_ 8.85% 1.161ms 8.85% 1.161ms 1.161ms 1.162ms 8.85% 1.162ms 1.162ms 1
aten::uniform_ 7.56% 992.395us 7.56% 992.395us 992.395us 991.776us 7.56% 991.776us 991.776us 1
aten::uniform_ 7.36% 966.001us 7.36% 966.001us 966.001us 966.240us 7.36% 966.240us 966.240us 1
aten::uniform_ 7.30% 958.179us 7.30% 958.179us 958.179us 957.120us 7.29% 957.120us 957.120us 1
aten::uniform_ 7.29% 956.692us 7.29% 956.692us 956.692us 957.760us 7.30% 957.760us 957.760us 1
aten::uniform_ 7.10% 932.259us 7.10% 932.259us 932.259us 929.216us 7.08% 929.216us 929.216us 1
aten::uniform_ 5.43% 711.949us 5.43% 711.949us 711.949us 712.384us 5.43% 712.384us 712.384us 1
aten::normal_ 4.34% 569.383us 4.34% 569.383us 569.383us 569.248us 4.34% 569.248us 569.248us 1
aten::uniform_ 3.49% 457.421us 3.49% 457.421us 457.421us 457.216us 3.48% 457.216us 457.216us 1
aten::max 2.48% 324.992us 2.95% 387.751us 387.751us 328.032us 2.50% 387.328us 387.328us 1
aten::normal_ 2.95% 386.638us 2.95% 386.638us 386.638us 387.072us 2.95% 387.072us 387.072us 1
------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 13.123ms
Self CUDA time total: 13.123ms
@ptrblck On single GPU, without DDP, i can achieve a speed of 500Hz (samples processed per second) and 95% GPU utilization. However, this suddenly drops to 50-60 Hz after half epoch, and GPU utilization fluctuates between 0% and 50%. Most of the time it remains 0%. Any idea why this could happen?
I’m unsure what the root cause is, but I would still try to investigate the system health and check, if resources are throttled. As said before, CPU as well as GPU benchmark software could be useful to check for consistent performance.
I also don’t think it’s a problem of the DataLoader, but a system issue and so far it seems you weren’t able to run any stress tests for the system.
I have noticed another thing. I am using num_workers = 5 in the DataLoader. So htop shows 6 processes in total for this job, (which means 5 for DataLoader workers plus 1 main). When the training stucks (slows down), 5 out of 6 processes go to D state (i.e., uninterruptible sleep) and they can’t go out of this state anymore. Whereas in the beginning (normal training) they keep fluctuating between D and R states.
The D state is usually triggered by I/O operations, so it could point towards a low data loading etc.
You could check the I/O throughput for the system and check, if it would drop after a while.
@ptrblck Well, i moved the data from HDD to SSD and it solved my problem. I didn’t know HDD can be such a bottleneck. Anyway, now on SSD the training is very smooth as expected, both on single- and multi-GPU devices.
You could try to force a sequential access of the files, since it would be faster than a random access, but in general the data loading will most likely stay the bottleneck.
I guess you could defragment the HDD, store the files sequentially in the desired order, and disable shuffling.
However, I would still assume that your performance would be worse than copying the data to the SSD.