Hey guys, I have a big dataset composed of huge images that I’m passing throw a resizing and transformation process.
I would like to save a copy of the images once they pass through the dataloader in order to have a lighter version of the dataset. I haven’t been able to find much on google. Can anyone guide me through this?
You could save each sample using torch.save, if you would like to save the tensors directly.
Note, that random data augmentation methods are applied with random parameters on the fly in your Dataset. If you store these augmented data samples, the transformations will be static now after reloading.
Could you explain your use case a bit and why you would like to store these samples?
Yes, so what I’m trying to do is train a resenet on a kaggle dataset. Unfortunately I don’t have the gpu’s for it so I have to rely on google colab.
As you may know, google colab resets every 12 hours and, in my case, one epoch takes around 1.5 hours.
The setup (downloading and unzipping the kaggle dataset) takes around 3 hours. So at the end of the 12 hours I’m barely able to get to 8 epochs. What I did was backup the weights for every epoch so that way I can reload them once it resets
Since the kaggle pictures are amazingly large, I’m trying to save a copy post transformations in order to be able to load them faster at the next reset.
That would work using my approach, but as I said, you would stick to these “static” transformations/augmentations.
I’m not really that familiar with Colab, but wouldn’t it be possible to download and unzipping the dataset once and store it in your Google Drive?
create a new script using your current ImageFolder approach and write a single loop over the complete training and validation dataset to store each elements in your drive (in the corresponding train/val folder)
Write a custom Dataset to load the tensors directly instead of the images via ImageFolder
I need to save the transformed images for better efficiency too. I tried saving the transformed tensors by torch.save, but a [3, 224, 224] image tensor takes about 100M memory? That’s irrational, why?
I am handling an image dataset with 100k images. I have tried saving the transformed image tensors into .jpg and .png images, they all look good. Then i use Image.open(imageFile).convert('RGB') and transforms.ToTensor() to read the saved images, but i can’t get the correct tensor. It’s different from the original transoformed image tensor, how did that happen and how can i get the correct tensor?