How can I solve this IO problem

zeming_hou · December 14, 2021, 2:35am

As you can see I need to use this kind of data during my training process, so IO has been a huge problem.
I also use DDP, which makes me tougher to adjust my code.
Is there a way in pytorch can handle this problem? Help me, please.

ptrblck · December 15, 2021, 9:32am

What kind of IO problem are you seeing?

zeming_hou · December 15, 2021, 10:01am

Loading data for training might take 10 seconds+，which makes my GPU -util 0% at the most of time.

ptrblck · December 15, 2021, 10:11am

You could try to increase the number of workers so that the loading latency might be hidden in the background while the model is training. Also, if you are using an older HDD, try to move the data to a faster SSD to speed up the loading time.

zeming_hou · December 15, 2021, 10:25am

I have already tried to increase the number of workers, and I found 4 is the best for my code.
The size of training data is about 400G, I don’t have SSD as that big.
But my HHD should be fast enough, almost 6GB/s.

ptrblck · December 15, 2021, 11:55am

I don’t think your HDD can achieve this bandwidth and based on e.g. these newegg specs you could get ~255MB/s:

this 3.5-inch drive features 8TB capacity, 7200RPM spinning speed, SATA 6 Gbps host interface and 256 MB Cache, and delivers sustained transfer rate of up to Up to 255 MB/s (for reference only)

You could also profile the read speed to check the “for reference only” claim and might see a lower performance.

zeming_hou · December 15, 2021, 1:46pm

I think the main problem might not be the speed of my HDD but the way I load files.
Every training I have to load 64 files of the size of 1.7M.
The operation to load files of this size might cause cpu jam.

zeming_hou · December 15, 2021, 1:50pm

Is there any way I can build a big file to save the data stocking in this file?
Loading a big file should be faster than a bunch of files of not small size.

ptrblck · December 15, 2021, 6:32pm

Yes, you could load multiple numpy files in a script, concatenate the arrays (assuming their shape allows it), and save the newly created (bigger) numpy array again.

zeming_hou · December 16, 2021, 1:47am

Maybe LMDB is a feasible way to solve my problem, what do you think?