Hello everyone
I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order.
img_0000001.jpg
img_0000002.jpg
img_0000003.jpg
...
and images in tar file are exactly the same.
I searched alot and find out tarfile
module is the best one, but when I tried to read images from tar file using name
, it takes too long. And the reason is, everytime I call getmemeber(name)
method, it calls getmembers()
method which scan whole tar file, then return a Namespace
of all names, then start finding in this Namespace
.
if it helps, my dataset size is 20GB single tar file.
I don’t know it is better idea to first extract all then use extracted folders in my CustomDataset
or reading directly from archive.
Here is the code I am using to read a single file from tar file:
with tarfile.open('data.tar') as tf:
tarinfo = tf.getmember('img_000001.jpg')
image = tf.extractfile(tarinfo)
image = image.read()
image = Image.open(io.BytesIO(image))
I used this code in my __getitem__
method of CustomDataset
class that loops over all names in filelist.txt
Thanks for any advice