Strange behavior in Pytorch

I saw a very strange behavior during training. The Training slows down after a few steps (generally after 50% steps in the first epoch). At this point the GPU utilization (of 6 or 7 out of 8 GPUs) goes to 100% from 90%. While the remaining 1 or 2 of the GPUs go down to 0% utilization. No matter what training configuration i set, i see the same behavior. In this way i have one epoch in approximately 4 hours, instead of less than 10 minutes.

Python Version: 3.6.9
[GCC 8.4.0]
PyTorch Version: 1.8.1+cu102
Number of GPUs: 8

I have no idea where is the bottleneck? Does anyone face similar problem?

1 Like

Assuming you are using DistributedDataParallel the 100% util. could point towards the GPUs waiting in NCCL calls for other devices to be ready. In that case the training steps of all devices seem to diverge and you could profile it with e.g. Nsight Systems. Since this effect seems to be triggered after some time, you could also check the GPU temperatures and make sure that no devices are downclocked due to excessive heat etc.

@ptrblck Thanks for your reply.

  • Yes i am talking about DistributedDataParallel, but the strange thing is that it happens even without DistributedDataParallel i.e., using a single GPU. This means that the problem doesn’t come from NCCL calls (correct me if i am wrong).
  • In case of single GPU after some 30000 to 40000 samples being processed, the training gets stuck, and the GPU utilization goes to 0% from 95%, whereas the CPU utilization fluctuates between minimum and 100%.
  • The temperature of the GPU remains around 70C at 95% usage (at smooth training), which is normal, i think, and is around 55C at 0% usage (when the problem occurs).
  • When this problem occurs, it also effects other processes on the machine. For example, i am processing the VoxCeleb dataset, which slows down to minimum when this problem occurs.
  • Also, i have tried with different versions of PyTorch, i.e., 1.6.0, 1.7.0, 1.8.1, and the latest 1.9.0. All behave the same.
  • Also, i have checked on different machines, with different GPUs (Machine 1 with Tesla T4, and Machine 2 with Titan RTX), i faced the same problem.
  • By the way i am trying to train this model, where other people do not have this problem. I have no idea why it happens in my case?

Thanks for the update.
Based on the description it seems that something might indeed be throttling your system.
Often the data loading might be a bottleneck and would let your GPU starve. However, since you are seeing the slowdown for all processes, I guess it’s affecting the entire system not only your PyTorch training script.
I still think a proper profile would be helpful to see, when the performance drops and which calls are executed. Also, you might consider stress testing your system and see, if it throttles for specific workloads (e.g. reduces the CPU freq. etc.).

@ptrblck I tried with torch.utils.bottleneck and following is the output:
Note: I ran the script for 1 epoch only, as you suggested here, and this is printed without starting the training (i.e., before the problem occurred).

  Environment Summary
PyTorch 1.9.0+cu111 DEBUG compiled w/ CUDA 11.1
Running with Python 3.6 and 

`pip3 list` truncated output:
  cProfile output
         15127188 function calls (15111211 primitive calls) in 8.870 seconds

   Ordered by: internal time
   List reduced from 2747 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      289    3.666    0.013    3.666    0.013 {method 'cuda' of 'torch._C._TensorBase' objects}
  1263775    1.174    0.000    1.905    0.000 /usr/lib/python3.6/
        1    1.069    1.069    4.304    4.304 /
  2442625    0.395    0.000    0.395    0.000 {method 'split' of 'str' objects}
  1264426    0.223    0.000    0.365    0.000 /usr/lib/python3.6/
  2560595    0.171    0.000    0.171    0.000 {method 'append' of 'list' objects}
1304392/1303795    0.149    0.000    0.155    0.000 {built-in method builtins.isinstance}
   173823    0.144    0.000    0.151    0.000 /usr/lib/python3.6/
  1265133    0.128    0.000    0.128    0.000 {method 'endswith' of 'str' objects}
  1270598    0.127    0.000    0.127    0.000 {method 'startswith' of 'str' objects}
        1    0.117    0.117    0.288    0.288 /<listcomp>)
  1268919    0.113    0.000    0.113    0.000 {built-in method posix.fspath}
        2    0.110    0.055    0.115    0.057 {method 'readlines' of '_io._IOBase' objects}
  1092599    0.091    0.000    0.091    0.000 {method 'strip' of 'str' objects}
        1    0.066    0.066    0.775    0.775 /

  autograd profiler output (CPU mode)
        top 15 events sorted by cpu_time_total

------------------  ------------  ------------  ------------  ------------  ------------  ------------  
              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  
     aten::normal_         9.89%       1.233ms         9.89%       1.233ms       1.233ms             1  
     aten::normal_         9.61%       1.198ms         9.61%       1.198ms       1.198ms             1  
     aten::normal_         9.54%       1.190ms         9.54%       1.190ms       1.190ms             1  
     aten::normal_         9.48%       1.182ms         9.48%       1.182ms       1.182ms             1  
     aten::normal_         9.42%       1.174ms         9.42%       1.174ms       1.174ms             1  
    aten::uniform_         6.81%     848.913us         6.81%     848.913us     848.913us             1  
    aten::uniform_         6.35%     791.197us         6.35%     791.197us     791.197us             1  
    aten::uniform_         6.26%     780.781us         6.26%     780.781us     780.781us             1  
    aten::uniform_         5.96%     743.103us         5.96%     743.103us     743.103us             1  
    aten::uniform_         5.94%     740.487us         5.94%     740.487us     740.487us             1  
    aten::uniform_         5.64%     703.088us         5.64%     703.088us     703.088us             1  
     aten::normal_         5.29%     660.099us         5.29%     660.099us     660.099us             1  
         aten::max         3.30%     411.601us         3.43%     427.419us     427.419us             1  
     aten::normal_         3.29%     410.359us         3.29%     410.359us     410.359us             1  
     aten::normal_         3.21%     400.519us         3.21%     400.519us     400.519us             1  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 12.467ms

  autograd profiler output (CUDA mode)
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
              Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
     aten::normal_         9.13%       1.198ms         9.13%       1.198ms       1.198ms       1.198ms         9.13%       1.198ms       1.198ms             1  
     aten::normal_         8.93%       1.172ms         8.93%       1.172ms       1.172ms       1.173ms         8.94%       1.173ms       1.173ms             1  
     aten::normal_         8.93%       1.172ms         8.93%       1.172ms       1.172ms       1.171ms         8.93%       1.171ms       1.171ms             1  
     aten::normal_         8.86%       1.163ms         8.86%       1.163ms       1.163ms       1.163ms         8.86%       1.163ms       1.163ms             1  
     aten::normal_         8.85%       1.161ms         8.85%       1.161ms       1.161ms       1.162ms         8.85%       1.162ms       1.162ms             1  
    aten::uniform_         7.56%     992.395us         7.56%     992.395us     992.395us     991.776us         7.56%     991.776us     991.776us             1  
    aten::uniform_         7.36%     966.001us         7.36%     966.001us     966.001us     966.240us         7.36%     966.240us     966.240us             1  
    aten::uniform_         7.30%     958.179us         7.30%     958.179us     958.179us     957.120us         7.29%     957.120us     957.120us             1  
    aten::uniform_         7.29%     956.692us         7.29%     956.692us     956.692us     957.760us         7.30%     957.760us     957.760us             1  
    aten::uniform_         7.10%     932.259us         7.10%     932.259us     932.259us     929.216us         7.08%     929.216us     929.216us             1  
    aten::uniform_         5.43%     711.949us         5.43%     711.949us     711.949us     712.384us         5.43%     712.384us     712.384us             1  
     aten::normal_         4.34%     569.383us         4.34%     569.383us     569.383us     569.248us         4.34%     569.248us     569.248us             1  
    aten::uniform_         3.49%     457.421us         3.49%     457.421us     457.421us     457.216us         3.48%     457.216us     457.216us             1  
         aten::max         2.48%     324.992us         2.95%     387.751us     387.751us     328.032us         2.50%     387.328us     387.328us             1  
     aten::normal_         2.95%     386.638us         2.95%     386.638us     386.638us     387.072us         2.95%     387.072us     387.072us             1  
------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 13.123ms
Self CUDA time total: 13.123ms

@ptrblck On single GPU, without DDP, i can achieve a speed of 500Hz (samples processed per second) and 95% GPU utilization. However, this suddenly drops to 50-60 Hz after half epoch, and GPU utilization fluctuates between 0% and 50%. Most of the time it remains 0%. Any idea why this could happen?

I’m unsure what the root cause is, but I would still try to investigate the system health and check, if resources are throttled. As said before, CPU as well as GPU benchmark software could be useful to check for consistent performance.

@ptrblck Apart from everything, the training goes very smooth (as expected) on the following GPUs (Single GPU training):

  • GeForce RTX 2080 (16 CPU cores)
  • Titan XP (48 CPU cores)

In these cases, the CPU usage does not go to 100%.

But it stucks on:

  • Tesla T4 (96 CPU cores)
  • Titan RTX (96 CPU cores)

In these cases, the CPU usage go to 100% for 2-3 cores and the rest are at bare minimum.

So i think its not a problem of DataLoader.

I also don’t think it’s a problem of the DataLoader, but a system issue and so far it seems you weren’t able to run any stress tests for the system.

@ptrblck No i didn’t run stress test yet.

I have noticed another thing. I am using num_workers = 5 in the DataLoader. So htop shows 6 processes in total for this job, (which means 5 for DataLoader workers plus 1 main). When the training stucks (slows down), 5 out of 6 processes go to D state (i.e., uninterruptible sleep) and they can’t go out of this state anymore. Whereas in the beginning (normal training) they keep fluctuating between D and R states.

The D state is usually triggered by I/O operations, so it could point towards a low data loading etc.
You could check the I/O throughput for the system and check, if it would drop after a while.

@ptrblck Well, i moved the data from HDD to SSD and it solved my problem. I didn’t know HDD can be such a bottleneck. Anyway, now on SSD the training is very smooth as expected, both on single- and multi-GPU devices.

@ptrblck While moving my data to SSD solved my problem, i wonder how to deal with this problem in case SSD is not available at all?

You could try to force a sequential access of the files, since it would be faster than a random access, but in general the data loading will most likely stay the bottleneck.

@ptrblck How to force a sequential access of the files?

1 Like

I guess you could defragment the HDD, store the files sequentially in the desired order, and disable shuffling.
However, I would still assume that your performance would be worse than copying the data to the SSD.

Hi, I encounter a similar problem. However, my data is already on SSD. I have no idea about how to make the program run normally.

I solved this problem by using single GPU, instead of multiple GPUs. That is quite wired.