How to avoid deep copying iterable when using torchdata?

Vedant_Roy · February 12, 2023, 5:33am

Let’s say I have a large in-memory array of byte buffers.
If I use a datapipe like:

dp = IterableWrapper(very_big_array)
dp = dp.sharding_filter()

(I believe) this will deep clone the very_big_array, since the datapipe must be pickled and sent to each worker process.

Is there anyway to get around this?

nivek · February 14, 2023, 9:30pm

Does it help if you use IterableWrapper with argument deepcopy=False? You can also have a custom DataPipe that holds a reference of very_big_array and read from it inside __iter__ without copying the data.