Hi,
I have a webdataset that is composed of tar files. I created a pipeline to use the dataset. Below is the code I use, decode is a function to do some preprocessing on the data (images and their caption in my case).
dp = FileOpener(list(braceexpand(data_path + "/{00000..05000}.tar")), mode="b")
dp = dp.load_from_tar(length=datasetLength).webdataset()
dp = dp.shuffle().sharding_filter()
dp.apply_sharding(num_processes, process_index, sharding_group=SHARDING_PRIORITIES.DISTRIBUTED)
dp = dp.map(decode)
dp = dp.batch(batch_size=batch_size, drop_last=True)
trainLoader = DataLoader2(dp)
It works fine but then I tried to use the MultiProcessingReadingService to make data loading faster. By doing that, I run into a pickle error.
Process ForkProcess-1:
Traceback (most recent call last):
File "/azureml-envs/azureml_99407ef20b35f1d5e9103d8f1bfac59a/lib/python3.8/site-packages/torch/utils/data/graph.py", line 67, in _list_connected_datapipes
p.dump(scan_obj)
TypeError: cannot pickle 'ExFileObject' object
I have dill installed but it doesn’t change anything.
Does anyone know what I am doing wrong ?
Thanks in advance,
Corentin