Dataloader stucks

Hi, developers:

I have the large training dataset which is packed in a zip file.
In train.py, I load it once and then pass it into dataloader, here is the code:

import zipfile
# load zip dataset
zf = zipfile.ZipFile(zip_path)
# read the images of zip via dataloader
train_loader = torch.utils.data.DataLoader(
                   DataSet(zf, transform),
                   batch_size = args.batch_size,
                   shuffle = True,
                   num_workers = args.workers,
                   pin_memory=True))

If the num_workers is set into the num larger than 1, the dataloader will stuck into the procedure that read images in zip file, it never stop to read images. How to fix this? Thanks.

3 Likes

When I understand correctly, this happens with regular use not e.g., as child processes upon e.g., quitting the run (was thinking that it might be potentially related to the discussion here PyTorch doesn’t free GPU’s memory of it gets aborted due to out-of-memory error and a bug in Python’s multiprocessing)

How does your “data loading / training” loop look like?
Maybe you could run sth like

for epoch in range(num_epochs):
    for batch_idx, (features, targets) in enumerate(train_loader):

and print the batch_idx – I would be curious what numbers you get if num_workers > 1, i.e. getting a condition batch_idx > num_training_examples / batch_size

1 Like

Reading same file from multiple process can be troublesome. You should look into if zipfile is multiprocessing safe.

This is the reason. That zipfile does not support multiple process to load it.

2 Likes

I had the same issue when opening a tarfile. A quick fix is to open a zipfile handle once at the start of __getitem__.

class MyDataSet(Dataset):
   def __init__(self, filename):
        self.zip_handle = None
        self.fname = filename

    def __getitem__(self, x):
        if self.zip_handle is None:
            self.zip_handle = zipfile.ZipFile(self.fname, 'r')
        # do stuff
3 Likes

Hi, how do you solve this problem? I found that when I using multi-thread to read the zipfile, it seems fine. But in I create a dataset class in you style and pass it to dataloader, it raise that:

File “/opt/conda/envs/pytorch-py3.6/lib/python3.6/zipfile.py”, line 925, in _read1
data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid distance too far back

I think this is independent to my original problem and highly related to the issue here. Try Python-3.5.

Do you mean there is something wrong with zip lib?
I using a global zipfile object to resolve this problem because I found that dataloader using multiprocessing.Process. I am not sure this way is correct.

Hi,

I’ve noticed that sometimes the closing of the opened resource helps. Using @victorhcm’s example I’m using the following kind of setup:

class MyDataSet(Dataset):
  def __init__(self, filename):
       self.zip_handle = None
       self.fname = filename

   def __getitem__(self, x):
       with zipfile.ZipFile(self.fname, 'r') as zip_handle:
           # do stuff while file is open
       # file is now closed

To add, my go-to files are either CSV’s or CSV-derived local SQLite databases. Especially the latter are very sensitive to leaving the connections open, i.e.

    ...
    def __getitem__(self, x):
        with self.db:
            #do stuff and close after
    ....

Petteri

2 Likes

I think if you create a zip handler directly in init of dataset classes, instead of transfer the params from outside, it will also works