I saved my processed data loader as .pt file to avoid the loading/preprocessing time every time I want to train a model and so that I could compare different networks that had been trained on exactly the same images.
My loader is called train.pt and when using torch.load(“path/to/train.pt”) I get the following error:
return legacy load(opened file, map_location, pickle_module, **pickle load args)
File "~/home/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
result = unpickler.load().
It also output a ModuleNotFoundError pointing to a folder that’s not being called at all in the script.
Finally, I want to add that when I try to load the dataloader from a python console, it does not give any error. Any help?
What did you assign to train.pt and how did you load your data in your Dataset?
If you’ve loaded it lazily, you won’t be able to store the DataLoader including the loaded files, as the DataLoader does not store the data internally.
If I cannot save my dataloader this way, how could I achieve my aim? (storing the transformed images to avoid the time it takes to transform + for reproducibility)
If you are lazily loading and transforming the data, you could save each sample in your __getitem__ method.
However, I don’t think this is a good idea, as you will just store the same number of samples after preprocessing and transforming them.
If your workload is large enough (i.e. if your model is not a small one and the GPU needs some time for the training), your data loading and processing might be fast enough to provide batches in the background.
Are you seeing a data loading bottleneck at the moment?
My biggest concern was the lack of reproducibility that could arise from each image being randomly cropped. I will now be centre cropping my images instead to work around this.
You are right, my workload is quite large and comparatively the loading is not that bad (maybe 5% of total running time). Thanks for your advice ptrblck !