Hi!
With TensorFlow’s tf.data.Dataset
, a dataset can be created from ([a, b, c], [d, e, f])
, in which case the tuples (a, d)
, (b, e)
and (c, f)
will be issued when reading it.
The main interest is that when calling .batch(X)
, each dimension is batched separately, which allows easy preprocessing vectorization.
Now, one may think hey, what about doing this with 2 DataPipes? - well, if the in-memory structure you are reading examples + labels from only returns both at a time, the only efficient (high performance) way of building the pipeline is to have a single one that does what TF do.
I am very new to TorchText but I felt like this was not possible, so I am asking here
Thanks in advance for the answers!