Where to put data transforms? Dataset or collate_fn?

Versus · January 24, 2021, 8:10pm

We have a dataset of texts. One item = 1 text.
And we have 3 transform-pipelines: one for X1, one for X2 and one for Y.

Question: what is the “correct” way to implement transforms?

Inside of Dataset? So that each item returns tensors. But then it is not reversible, in case I want to see the original text.
Outside of dataset and before dataloader (my current implementation)
Or inside of collate_fn of Dataloader. Then each item is converted to X1, X2 and Y on batch level. Somehow, collate_fn doesn’t seem to me like the right place for such operations.

Ideally, I need transforms to run only when training starts.
I will appreciate your ideas.

ptrblck · January 25, 2021, 4:27am

The usual workflow would be to add the transformations in the Dataset.__getitem__.
If you need to see the original text (without the transformations), you could return the original as well as the transformed samples from __getitem__.