What should collator do exactly?

Suppose we have an audio classification task (AudioMNIST).

My pipeline and other pipelines I’ve seen consist of the next steps:

  1. Read the dataset (the data samples).
  2. Do the base transforms (merge the audio channels, change the bitrate, etc).
  3. Split the dataset into the train one, the test one, etc.
  4. Do the main transforms (different for the train and the test) such as the augmentation.
  5. Batch (along with the sampling).
  6. Pad/Truncate the batch samples.
  7. Do the forward pass with the batch.
  8. <…>

I saw the scheme:

  • Dataset or a subclass - pp. 1., 2., 3., 4.
  • Collator - p. 6.

Either:

  • Dataset or a subclass - p. 1.
  • somebody else - pp. 2., 3., 4.
  • Collator - p. 6.

Either:

  • Dataset or a subclass - p. 1.
  • somebody else - p. 3.
  • Collator - pp. 2., 4., 6.

What should the collator do and what shouldn’t? (The main question.)

What is the correct scheme? :slight_smile:

Thanks!

@ptrblck , would appreciate your answer :slight_smile:

The collate function is defined in the DataLoader and usually responsible to create the actual batch using the returned samples from the Dataset.__getitem__. The __getitem__ would load, transform, and return a single samples while the collate_fn would then optionally pad the samples and stack them to a single batch.
Let me know if this fits your understanding or if more details are needed.

@ptrblck , thanks! But if I want to populate samples? So, one dataset sample goes to five augmented ones. Who should do it?

If you want to load a single sample and transform it multiple times, you could add these operations into the __getitem__ and return all samples at once. This would of course increase your actual batch size since by default a single sample is expected in the collate_fn. Let me know if I misunderstand your use case or question.

Well, to learn the DataLoader subclass to call dataset.get_augmented_items(n: int) -> list[T_co] and… rearrange batches… Understand.

The collator doesn’t change its behavior meanwhile.