General question:
- If you want to do some pre-processing of text, would you do in before the loading of the dataset in the Dataset init method, or inside the __getitem__method?
Specific questions:
- Say you filter some samples, say anything less than a length of x. Then you do that in getitem() method since you need to return a sample.
- Say you are text pre-processing like sentence tokenization, if you do in getitem method, you would do everytime(whenever it is sampled again) right or probably inefficient? Or it cached?