I would like to implement a custom
Dataset that has a feedback loop connected to my training code. In specific, after each epoch I would like to return information about the training progress to the dataset so that I can make changes within the dataset that will (hopefully) improve the training.
Now my question is: Is this possible? I know that at the start of training the PyTorch
Dataloader creates independent copies of my
Dataset object and passes them to each worker. So changing something in the original dataset I created before the start of training would not work in my understanding as it does not alter the dataset copies in the worker processes. Is there some kind of workaround for this limitation or is the “dataset-training loop” connection meant as a strict one-way street in PyTorch?
This is correct (if you are using multiple workers in the
DataLoader), but note that you should still be able to manipulate the underlying
loader.dataset after each epoch and before starting a new one as long as you are not using
Thank you for your response @ptrblck!
I will test your suggestion today. At the beginning of my current project, I immediately set
persistent_workers=True as I assumed that it speeds up the execution. Do you have some experience on how large the performance penalty between turning persistent workers on and off is?
I don’t think a general statement would be useful, as te penalty depends on the latency to create the workers and thus e.g. on the workload in
Dataset.__init__. I.e. especially if you are pre-loading data, the penalty might be large, otherwise negligible.