How much work should your dataset class do for you?

I’m trying to determine if I should refactor the way may dataset __getitem__ method pre-processes.

Are there any heuristics beyond the obvious time complexity and code readability I should consider?

For a given audio file that I load I want to:

  1. Pad the length to match a given interval size.
  2. Cut the audio into equal pre-determined interval lengths
  3. Create several types of spectrograms of each piece of this new sequence.

It’s a lot. Is there any reason that I shouldn’t do all of these things right inside of __getitem__?

If any of the tasks are independent of the audio file, you can perform them in the __init__ method. Else, instead of doing everything in the __getitem__ method, you can write custom transforms and use them in the order you want. I am not really sure if this is more efficient than doing everything in the __getitem__ method but you can try and check if it is.

__getitem__ is the process which is spawn with multiprocess.
The workload should be there (mainly the spectrograms)
As @hash-ir mentions, any other task can be performed in the __init__ function (or even outside the dataset class)

It’s all about how many RAM do you have.
Do you need to read audio in __getitem__?
Well, if your dataset fits in your RAM you can preload it in init.
You can use any python function inside __getitem__, therefore it can still be readable.

2 Likes