Missing child_counter from StreamWrapper?

Hello!

While using datapipes (torchdata 0.5.1) I ran into the following:

S3FileLoader feeds into a ZipArchiveLoader. There is an exception in line 71 of ZipArchiveLoader because children_counter is missing from the parent stream (the one that the S3 loader provides) If I add (before try/except)

 data_stream.child_counter = 0

then everything runs as expected. Is this a bug?

Thanks!

  1. Can you try your code with TorchData 0.6.0?
  2. What does your DataPipe set up looks like? Is there a minimal, reproducible code snippet that you can provide?

Hello!

Yes → I will update to 0.6.0 after this experiment completes.
Yes → I will put together the snippet tomorrow.

Thanks!

Hello, this is the code that reproduces the problem

            for url, stream in S3FileLoader(IterableWrapper([f's3://{self.__bucket}/{self.__dataset}/zip/{archive}'])):
                if archive.lower().endswith('.zip'):
                    for name, doc in ZipArchiveLoader(IterableWrapper([('XXXX', stream)])):
                        yield name, doc

I hope this helps

I don’t think your code snippet is using TorchData in the intended pattern, can you try something like this?

dp = IterableWrapper([f's3://{self.__bucket}/{self.__dataset}/zip/{archive}'])
dp = dp.filter(_insert_filter_fn_here)  # Filter for ".zip"
dp = dp.load_files_by_s3()
dp = dp.load_from_zip()

This works. The attribute is in place now. Thanks!