MultiProcessingReadingService problem with tar files

corentin_r · August 24, 2023, 7:10pm

Hi,

I have a webdataset that is composed of tar files. I created a pipeline to use the dataset. Below is the code I use, decode is a function to do some preprocessing on the data (images and their caption in my case).

dp = FileOpener(list(braceexpand(data_path + "/{00000..05000}.tar")), mode="b") 
dp = dp.load_from_tar(length=datasetLength).webdataset()
dp = dp.shuffle().sharding_filter()
dp.apply_sharding(num_processes, process_index, sharding_group=SHARDING_PRIORITIES.DISTRIBUTED)
dp = dp.map(decode)
dp = dp.batch(batch_size=batch_size, drop_last=True)

trainLoader = DataLoader2(dp)

It works fine but then I tried to use the MultiProcessingReadingService to make data loading faster. By doing that, I run into a pickle error.

Process ForkProcess-1:
Traceback (most recent call last):
  File "/azureml-envs/azureml_99407ef20b35f1d5e9103d8f1bfac59a/lib/python3.8/site-packages/torch/utils/data/graph.py", line 67, in _list_connected_datapipes
    p.dump(scan_obj)
TypeError: cannot pickle 'ExFileObject' object

I have dill installed but it doesn’t change anything.
Does anyone know what I am doing wrong ?

Thanks in advance,
Corentin

cinjon · October 19, 2023, 6:11pm

Hitting ~ the same problem. Did you solve this?

corentin_r · October 21, 2023, 12:16pm

Hey,

I ended up decompressing my webdataset and creating a regular pytorch dataset.
I used the regular DataLoader with multiple workers to speed it up.
Since my dataset was on the same server as my compute cluster I didn’t have too much trouble with latency.

Here is what I used:

def __getitem__(self, index):
    line = self.captionKey.iloc[index]
    caption, key = line["caption"], line["key"]

    key = str(key).zfill(9)

    stream = BytesIO()
    numberAttempts = 0
    while True:
        try:
            self.containerClient.download_blob(key[:5] + key).readinto(stream)
            break
        except Exception:
            numberAttempts += 1
            time.sleep(1)
            if numberAttempts > 10:
                raise Exception(f"Impossible to download the image {key[:5] + key}")

    with Image.open(stream) as image:
        image = self.preprocess(image)

    return image, caption