I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order.
img_0000001.jpg img_0000002.jpg img_0000003.jpg ...
and images in tar file are exactly the same.
I searched alot and find out
tarfile module is the best one, but when I tried to read images from tar file using
name, it takes too long. And the reason is, everytime I call
getmemeber(name) method, it calls
getmembers() method which scan whole tar file, then return a
Namespace of all names, then start finding in this
if it helps, my dataset size is 20GB single tar file.
I don’t know it is better idea to first extract all then use extracted folders in my
CustomDataset or reading directly from archive.
Here is the code I am using to read a single file from tar file:
with tarfile.open('data.tar') as tf: tarinfo = tf.getmember('img_000001.jpg') image = tf.extractfile(tarinfo) image = image.read() image = Image.open(io.BytesIO(image))
I used this code in my
__getitem__ method of
CustomDataset class that loops over all names in
Thanks for any advice