Handling huge dataset ,practical issuses

I am currently attempting to train using the laion400m Dataset, but I am encountering numerous issues. The data has been downloaded and is organized in the following manner:

  • 00000.parquet
  • 00000.tar
  • 00000_stats.json

There are approximately 40,000 triplets of files structured in this manner.

My initial approach involved untaring all the files, which resulted in the following triplets:

  • 004137496.txt
  • 004137496.jpg
  • 004137496.json

However, I have encountered several problems during this process:

  1. Untaring the files is a time-consuming operation.
  2. Untaring the files consumes a significant amount of memory.
  3. Handling a directory containing 400 million files (400m * 3) is extremely challenging.

I am seeking advice on the common practices employed when working with datasets of this nature. Has anyone worked with the laion 400m Dataset before? Are there any suggestions or recommendations?

By the way, I attempted to stream the data using Hugging Face, but this method proved to be 30 times slower.

Thank you for any assistance you can provide!