Hi,
I’m building a denoising pipeline with a map-style Dataset and wrapper composition. I’d like some guidance on the best way to reliably get reproducible on-the-fly noise generation.
Current setup
- Base Dataset returns a dictionary clean
np.ndarraysamples from a folder with HDF5 files. - Multiple wrappers are chained to rescale the data, add noise, and transpose (in NumPy).
- The DataLoader does the batching/shuffling/custom
collate_fnto transform my data to a Tensor.
The dictionary returned from my base dataset object has the following structure
{
"data": {
"clean": <data>
},
"metadata": {...}
"file_metadata": <metadata from the file the sample was taken from, to check correctness>
}
The data in question is multispectral timeseries image data with the following shape per sample: (T, C, H, W) (T=timestep, C=channel)
What I want
- Reproducibility across runs.
- Option to have:
- same noise per sample every epoch
- new noise each epoch, but reproducible based on a starting seed
- Stable, no matter
num_workers
Design questions
- Is it better to derive per-sample RNG inside
__getitem__using:SeedSequence([base_seed, idx])for fixed noise, orSeedSequence([base_seed, epoch, idx])for epoch-varying noise?
- Is passing a RNG Generator into the noise function the right design, or is there a better approach for PyTorch.
- With
persistent_workers=True, how do I propagate epoch state to worker-side dataset copies? - Is passing an
np.random.Generatorinto noise functions the correct approach, or is there a better PyTorch pattern? - Any gotchas mixing
DataLoader(generator=...)seeding with NumPy RNG seeding?
Hope to hear your thoughts. This is also my first time working wityh PyTorch in this way, so open for feedback! If code snippets are needed, let me know.