Where to do text pre-processing, inside __getitem__ or before loading dataset?

General question:

  1. If you want to do some pre-processing of text, would you do in before the loading of the dataset in the Dataset init method, or inside the __getitem__method?

Specific questions:

  1. Say you filter some samples, say anything less than a length of x. Then you do that in getitem() method since you need to return a sample.
  2. Say you are text pre-processing like sentence tokenization, if you do in getitem method, you would do everytime(whenever it is sampled again) right or probably inefficient? Or it cached?

To be honest, I do the preprocessing of the training data as a completely separate step. Seems cleaner to me.