DataLoader for text data, padding batches but removing data processing overhead

I am looking for documentation on a couple of different functionalities that I would like to optimize in my code during batch generation:

  1. I have rather expensive preprocessing operations to compute on my input data that I would like to parallelize across workers.
  2. I would only like to compute these operations once, at first data load, not each time _getitem_ is called on my Dataset instance.
  3. I would like to generate batches which are padded to all be the same length, based on the element of max length in the batch.
  4. I would like to randomly shuffle the order of my batches at train time, but not the instances within each batch (in order to avoid excessive padding due to high variance of input size).

All of the tutorials I’ve seen (such as https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) recommend performing transforms on the data in the _getitem_ method of the Dataset class and using DataLoader to produce batches by shuffling all instances. Could anyone point me in the direction of tutorials or docs on doing the above pointers, or on whether it just makes sense to write my own generator?