New noise generation in DataLoader

Hi,

I’m building a denoising pipeline with a map-style Dataset and wrapper composition. I’d like some guidance on the best way to reliably get reproducible on-the-fly noise generation.

Current setup

  1. Base Dataset returns a dictionary clean np.ndarray samples from a folder with HDF5 files.
  2. Multiple wrappers are chained to rescale the data, add noise, and transpose (in NumPy).
  3. The DataLoader does the batching/shuffling/custom collate_fn to transform my data to a Tensor.

The dictionary returned from my base dataset object has the following structure

{
  "data": {
    "clean": <data>
  },
  "metadata": {...}
  "file_metadata": <metadata from the file the sample was taken from, to check correctness>
}

The data in question is multispectral timeseries image data with the following shape per sample: (T, C, H, W) (T=timestep, C=channel)

What I want

  1. Reproducibility across runs.
  2. Option to have:
    • same noise per sample every epoch
    • new noise each epoch, but reproducible based on a starting seed
  3. Stable, no matter num_workers

Design questions

  1. Is it better to derive per-sample RNG inside __getitem__ using:
    • SeedSequence([base_seed, idx]) for fixed noise, or
    • SeedSequence([base_seed, epoch, idx]) for epoch-varying noise?
  2. Is passing a RNG Generator into the noise function the right design, or is there a better approach for PyTorch.
  3. With persistent_workers=True, how do I propagate epoch state to worker-side dataset copies?
  4. Is passing an np.random.Generator into noise functions the correct approach, or is there a better PyTorch pattern?
  5. Any gotchas mixing DataLoader(generator=...) seeding with NumPy RNG seeding?

Hope to hear your thoughts. This is also my first time working wityh PyTorch in this way, so open for feedback! If code snippets are needed, let me know.