Pytorch geometric dataset loading is slow

Edan_Patt · August 2, 2022, 2:16pm

Hello everyone,
I created my own large dataset. Every time I want to run or train anything the dataset has to be processed. I tried my best to optimize this process but after profiling my run I found out that the function
“file_exists” (torch_geometric.data.dataset — pytorch_geometric 1.4.3 documentation)
is taking up 87% of my runtime with its calls to posix.stat. Is there any clear cut way to optimize this part of the function call? Or anything I can do with the way my files are written or ordered to help out with this?

Thanks

ejguan · August 5, 2022, 4:58pm

Based on the implementation, files_exist only loops through files and figure our if any file is missing from your FS.

Do you want to try with torchdata since we are advocating streaming way to load data. So, you can download your data on the fly rather than downloading them beforehand. Here is the doc.