I am sorry that currently I can not create a minimum code sample to reproduce the problems I meet.
Clues I can provide:
As can be seen from the screen shot, the htop command shows that four cpu cores are dominated by the red color, which means they are in kernel mode (from F1 help of htop).
Four workers are used in my dataloader so the four red cores are exactly the ones belongs to my task.
As long as this happens, my training code gets extremely slow. The gpu usage decrease to dramatically (most of the times to zero). In the below figure, I use device 0.
The information currently know:
- If the process goes into kernel mode, it is likely that IO operation is called. So, the first guess would be my dataloader is waiting for the data from disk.
a) If too much data waiting to be loaded from disk?
I checked the io speed using a serious of tools such as IOTop, hdparm.
- (Disproof) The reported disk reading speed is around 20MB-30MB, which is fine for even a hdd. Not to mention I use ssd to store my dataset.
b) If my disk is broken?
(Proof) If my task stop reading data from disk (replace cv2.imread/PIL.Image.open with random data by numpy) while all other codes remain the same, training task is always fine. This means the problems are related to IO.
(Disproof) As the training task runs, I tested the disk performance, the IO band-width is around 300MB. So I think the disk is all right. The training task of other people on the same server is all right. They never face my problems although we are using the same disks.
(Disproof) If I execute two training tasks in parallel, the problems almost always happens. But this also always happens if two tasks load data from different disks (both SSDs, one nvme + one Sata). In this cases, both task slows down dramatically although they are loading from different disks. As long as I stop one task, the other will resume to normal trainng speed with a high GPU utilization. I can not suspect that all three disks (hdd+sata ssd+nvme ssd) on my server are broken. In this case, it is likely the disks are not the ones to be blamed but other shared hardwares are. RAM? PCIE?.. I am do not know.
(Disproof) The problem do not always happen. If I restart my training task, there is a probability that it disappear and everything goes fine.
If the problem shows at the beginning of the task, there is little possibility that problem will disappear later.
If the problem do not showup at the beginning, it will almost never happens later.
c) If something wrong with opencv?
- (Disproof) Tried both opencv and PIL.Image. The problems are the same. In the case of using opencv, the following codes are also tested, not working.
d) If something wrong with certain version of torch?
- (Disproof) The problem occurs on versions of 1.7.0, 1.7.1, 1.8.0.
e) If something wrong with other third-party libraries?
- I have not found a clue.
f) Short of memory?
- There are 64GB RAM on server and I see neither OOM error nor utilization of swap.
I searched for a long time and tried all the suggestions I found.
What shall I do next to debug this problem.