Fastest way to read images from uncompressed TAR file in __getitem__ method of Custom Dataset

Hello everyone
I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order.

img_0000001.jpg
img_0000002.jpg
img_0000003.jpg
...

and images in tar file are exactly the same.
I searched alot and find out tarfile module is the best one, but when I tried to read images from tar file using name, it takes too long. And the reason is, everytime I call getmemeber(name) method, it calls getmembers() method which scan whole tar file, then return a Namespace of all names, then start finding in this Namespace.

if it helps, my dataset size is 20GB single tar file.

I don’t know it is better idea to first extract all then use extracted folders in my CustomDataset or reading directly from archive.

Here is the code I am using to read a single file from tar file:

        with tarfile.open('data.tar') as tf:
            tarinfo = tf.getmember('img_000001.jpg')
            image = tf.extractfile(tarinfo)
            image = image.read()
            image = Image.open(io.BytesIO(image))

I used this code in my __getitem__ method of CustomDataset class that loops over all names in filelist.txt

Thanks for any advice

Based on your description it seems to be better to extract all files beforehand and read them separately.

Alternatively, would it be possible to call tf.getmembers() once in your __init__ method or use the filenames stored in your filelist.txt to get each file using tf.extractfile([self.filelist[index]]) in __getitem__?
Based on the docs extractfile should also take a file name directly.

2 Likes