Dataloader with zipfile failed

wgting · April 17, 2019, 3:53am

I am trying to load data from a zip file by Python zipfile library. However, it seems that it cannot compatible with the torch’s Dataloader class.

import numpy as np
import cv2
import io
from torch.utils.data import DataLoader, Dataset
from torchvision.transforms import ToTensor
import zipfile

class ZipDataset(Dataset):
    def __init__(self, root_path, cache_into_memory=False):
        if cache_into_memory:
            f = open(root_path, 'rb')
            self.zip_content = f.read()
            f.close()
            self.zip_file = zipfile.ZipFile(io.BytesIO(self.zip_content), 'r')
        else:
            self.zip_file = zipfile.ZipFile(root_path, 'r')
        self.name_list = list(filter(lambda x: x[-4:] == '.jpg', self.zip_file.namelist()))
        self.to_tensor = ToTensor()

    def __getitem__(self, key):
        buf = self.zip_file.read(name=self.name_list[key])
        img = self.to_tensor(cv2.imdecode(np.fromstring(buf, dtype=np.uint8), cv2.IMREAD_COLOR))
        return img

    def __len__(self):
        return len(self.name_list)

if __name__ == '__main__':
    dataset = ZipDataset('COCO.zip', cache_into_memory=False)
    dataloader = DataLoader(dataset, batch_size=2, num_workers=2)
    for batch_idx, sample in enumerate(dataloader):
        print(batch_idx, sample.size())

When num_workers=0 or num_workers=1, everything works well. But if the num_workers is larger than 1, the program will raise a strange error:

Traceback (most recent call last):
  File "test_zip_file.py", line 31, in <module>
    for batch_idx, sample in enumerate(dataloader):
  File "/home/admin/anaconda3/envs/pytorch1_0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/home/admin/anaconda3/envs/pytorch1_0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
zipfile.BadZipFile: Traceback (most recent call last):
  File "/home/admin/anaconda3/envs/pytorch1_0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/admin/anaconda3/envs/pytorch1_0/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "test_zip_file.py", line 21, in __getitem__
    buf = self.zip_file.read(name=self.name_list[key])
  File "/home/admin/anaconda3/envs/pytorch1_0/lib/python3.6/zipfile.py", line 1337, in read
    with self.open(name, "r", pwd) as fp:
  File "/home/admin/anaconda3/envs/pytorch1_0/lib/python3.6/zipfile.py", line 1419, in open
    % (zinfo.orig_filename, fname))
zipfile.BadZipFile: File name in directory '000000000009.1.jpg' and header b'\x00(\xa2\x8a\x00(\xa2\x8a\x00(\xa2\x8a\x00(\xa2\x8a\x00(\xa2\x8a\x00(\xa2\x8a\x00(\xa2\x8a\x00(\... '  differ.

It looks like that zipfile cannot be read in the multiprocessing manner. But interestingly, if the we set cache_into_memory=True (which means that the total zip file will be read into memory), the program will work fine.

This code has been tested in Windows 10 / Ubuntu 16.04, torch 0.41 & 1.0.0. All of them have the same results.

mathematics · June 10, 2020, 3:32pm

similar like error is happening to me when using dataloader for Celeba but not for mnist.

Error is

-----------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
<ipython-input-21-51d4dd0d2430> in <module>()
      4                   transform=transforms.Compose([
      5       transforms.ToTensor(),
----> 6       transforms.Normalize((1, -1, 0.5), (0.5, -0.5, 1))
      7       ])
      8                   )

3 frames
/usr/lib/python3.6/zipfile.py in _RealGetContents(self)
   1196             raise BadZipFile("File is not a zip file")
   1197         if not endrec:
-> 1198             raise BadZipFile("File is not a zip file")
   1199         if self.debug > 1:
   1200             print(endrec)

BadZipFile: File is not a zip file

where only i was just downloading through DataLoader
as

dataset = torch.utils.data.DataLoader( datasets.CelebA('data/', download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((1, -1, 0.5), (0.5, -0.5, 1)) ]) ) )
I have torch version of ‘1.5.0+cu101’

Even for ImageNet it is now not publicly accessible ,why

RuntimeError: The dataset is no longer publicly accessible. You need to download the archives externally and place them in the root directory.

Appreciate for helping

ptrblck · June 11, 2020, 8:46am

You could try to download the files again to a different folder to check, if maybe the download failed.

The authors of the ImageNet dataset removed a publicly available download link, so that you would have to register again in order to download it (or use another mirror you can find).

mathematics · June 15, 2020, 3:52am

oh , Apologies for the slow reply, Downloading in different folder does work.

haowxu · June 22, 2020, 8:54pm

I’d like to know if someone knows the solution of the original question. I am facing a similar question now. My data loader using zipfile fails when the num_workers is set greater than 1, but the exception I got is CRC-32 error. I guess this is related to the multi-threading of reading from a single ZipFile instance.

Thanks!

superbock · August 12, 2020, 12:13pm

I had a similar issue with zipfile giving random errors. It seems to be related to having the handle opened and saved during instantiation – which does not work with multiprocessing. The solution was to open the handle in __getitem__() and cache it.

HTH

Rohit_Choudhary · December 14, 2020, 8:30am

Yes. I had the same issue. Everything works fine while reading using single thread, but using dataloader with multi-thread raises exception zipfile.BadZipFile: Bad CRC-32 for file 'train2017/000000431472.jpg'

train2017 is the COCO dataset.

Rohit_Choudhary · December 14, 2020, 9:22am

Isn’t it the case that caching will need more RAM ? I have 18GB data, but colab have only 12GB RAM. Did some have some solution for this case ?

zipfile working fine on single thread, but with dataloader and multi-threading throws error.

Rohit_Choudhary · December 14, 2020, 9:24am

This is not working for me.

haowxu · January 21, 2021, 2:51am

So I just guess zipfile does not support multi-threading. Need to find a workaround.

aaaa · January 30, 2022, 7:58am

I had the same problem. Have you solved it yet

usryokousha · November 2, 2022, 8:20am

You should be able to fix the num_workers issue by getting around the python GIL (global interpreter lock). This is done via a lock as shown below.
python```
from threading import Lock
from torch.utils.data import Dataset
import zipfile
import io
import PIL.Image as Image

class ImageDataset(Dataset):
def init(self, zip_path, transform):
zipobj= zipfile.ZipFile(zip_path, ‘r’)
self.nameslist = zipobj.nameslist()
self.lock = Lock()

def len(self):
return len(self.nameslist)

def getitem(self, idx):
with self.lock:
img = Image.open(io.BytesIO(self.zipobj.read(self.nameslist[idx])))
return img