I just do not know what is the problem with my platform. It is very slow to process images with python. Thus I am thinking about using c++ to implement the part of loading and preprocessing samples, but still use pytorch for the other parts of the training loop.
Is there any examples of this? Or would you please provide some ideas on how to implement this?
I think the problem is the multi-processing. I just found that launching two identical training script would make training much slower, I think this is because the machine is too weak to support launching multi-process. Given that the cpp frontend dataloader uses threads, it would make the machine run with less burden and thus training would be much faster.
By the way, recently, I was trying to deduplicate images in a dataset, I just found that launch a python process pool with 128 processes is almost 10x slower than a cpp program with 64 threads. I think this machine prefers cpp to python.