Best order to operate on tensors in DataSet and DataLoader

I haven’t found any list of recommendations on the docs or discussion about the best order to operating on tensors before feeding them to the network.

I ask because I got my DataSet and DataLoader (with a padding function call) working well yesterday. However, I was shocked at how long it takes to build up a batch.

DataSet
Opening each HDF5 file, extracting the multiple input and one output numpy arrays and converting to tensors, takes about 0.3sec per sample (file are average 25MB). I only have 48GB of ram, so I can’t take on too many workers if the batch size is decent. However, the workers are barely taxing the SSD.
Operations include:

  • loading each of the 7 ndarrays, transposing, and converting to tensors of the correct type. (not setting contiguous or pinning here)
  • storing each tensor inside a dict, which is returned

DataLoader
Collate_fn operations (which take about 0.05-0.2 sec per sample):

  • Taking the list of dicts, getting the maximum dimension used to pad those samples
  • enumerating over the batch to build up a list for each tensor “type” (input_num, output, weights, etc), while simultaneously padding each tensor before it is added to the list
  • stacking each list

Finally, I have the DataLoader with pin_memory=True, and I copy the tensors to the GPU inside the training batch enumerator.

None of these times seem really large, but when you’re talking batch sizes of 100 samples it works out to almost 40 seconds. Does anything jump out as room for improvement, as other than the transpose, padding, and stacking, I’m doing no other operations on the data itself.