What's the point of datapipe.fork if it doesn't copy the underlying data?

I’m trying to fork the WikiText2 datapipe, so that I can do a different set of transforms to two instances of the data, but if I use datapipe.fork described here: Forker — TorchData main documentation, the underlying data is identical (confirmed using data_batch is data2_batch). Am I just missing the use case for .fork? And is there a better way to split datapipes?

I don’t 100% understand your question. Can you elaborate or provide a code snippet of what isn’t working?

In your case, you can iterate through data_batch and data2_batch separately and perform different sets of transforms. Then, you can later combine the outputs together again if you’d like.

Yeah sure. So if I do something like:

dp1, dp2 = source_dp.fork(num_instances=2)
dp1 = dp1.map(dp1_transform)
dp2 = dp1.zip(dp2)

Then I get a series of tuples with the same underlying data even though dp1 had a different transform applied than dp2, because the underlying data returned by fork is the same. In my case, I’m trying to split the WikiText2 data into two identical copies and then apply random masking to one set and rejoin the two before passing into a dataloader. With the fork operation, I can’t seem to apply the masking operation to one fork without affecting the other.

What are your objects inside? This would work as expected.

from torchdata.datapipes.iter import IterableWrapper

dp = IterableWrapper(list(range(10)))
dp1, dp2 = dp.fork(num_instances=2)
dp1 = dp1.map(lambda x: x + 100)
dp3 = dp1.zip(dp2)

print(list(dp3))  # [(100, 0), (101, 1), (102, 2), (103, 3), (104, 4), (105, 5), (106, 6), (107, 7), (108, 8), (109, 9)]

I assume you have a mutable object? If so, you can make a copy of it within dp1_transform to avoid affecting the copy in dp2.

1 Like

I can see how this is confusing, feel free to open an issue on GitHub and suggest adding an argument to deep copy outputs coming out of fork. We can discuss further there and see if there is enough interest for a PR/change. I think there is certainly some uses case for that. Thanks for bringing this up.

Thanks. I’ll open an issue when I get the chance.