Fastest way to read images from uncompressed TAR file in __getitem__ method of Custom Dataset


(Mohammad Doosti Lakhani) #1

Hello everyone
I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order.

img_0000001.jpg
img_0000002.jpg
img_0000003.jpg
...

and images in tar file are exactly the same.
I searched alot and find out tarfile module is the best one, but when I tried to read images from tar file using name, it takes too long. And the reason is, everytime I call getmemeber(name) method, it calls getmembers() method which scan whole tar file, then return a Namespace of all names, then start finding in this Namespace.

if it helps, my dataset size is 20GB single tar file.

I don’t know it is better idea to first extract all then use extracted folders in my CustomDataset or reading directly from archive.

Here is the code I am using to read a single file from tar file:

        with tarfile.open('data.tar') as tf:
            tarinfo = tf.getmember('img_000001.jpg')
            image = tf.extractfile(tarinfo)
            image = image.read()
            image = Image.open(io.BytesIO(image))

I used this code in my __getitem__ method of CustomDataset class that loops over all names in filelist.txt

Thanks for any advice


#2

Based on your description it seems to be better to extract all files beforehand and read them separately.

Alternatively, would it be possible to call tf.getmembers() once in your __init__ method or use the filenames stored in your filelist.txt to get each file using tf.extractfile([self.filelist[index]]) in __getitem__?
Based on the docs extractfile should also take a file name directly.