Hey guys, I have a big dataset composed of huge images that I’m passing throw a resizing and transformation process.
I would like to save a copy of the images once they pass through the dataloader in order to have a lighter version of the dataset. I haven’t been able to find much on google. Can anyone guide me through this?
You could save each sample using torch.save, if you would like to save the tensors directly.
Note, that random data augmentation methods are applied with random parameters on the fly in your Dataset. If you store these augmented data samples, the transformations will be static now after reloading.
Could you explain your use case a bit and why you would like to store these samples?
Yes, so what I’m trying to do is train a resenet on a kaggle dataset. Unfortunately I don’t have the gpu’s for it so I have to rely on google colab.
As you may know, google colab resets every 12 hours and, in my case, one epoch takes around 1.5 hours.
The setup (downloading and unzipping the kaggle dataset) takes around 3 hours. So at the end of the 12 hours I’m barely able to get to 8 epochs. What I did was backup the weights for every epoch so that way I can reload them once it resets
Since the kaggle pictures are amazingly large, I’m trying to save a copy post transformations in order to be able to load them faster at the next reset.
That would work using my approach, but as I said, you would stick to these “static” transformations/augmentations.
I’m not really that familiar with Colab, but wouldn’t it be possible to download and unzipping the dataset once and store it in your Google Drive?
Your transformation does not include any random transforms, so it should be alright.
If your approach is working, you could also add the random augmentation later when reloading the data.
PS: In which foramt is the data stored at the moment? If Kaggle is using some image format with compression (e.g. JPG), your current approach might take more memory than the original dataset.
I’m not sure, since you said 80GB will blow up your Google drive.
Where would you like to store the tensors now, if they take more memory than their compressed jpeg equivalent?
You could create a loop using the Dataset (or DataLoader) and just use torch.save inside.
I’m not sure, how to access your Google Drive as a file path, but you would most likely know it already.
dataset = MyDataset(..., transform=transform)
for idx, data, target in enumerate(dataset):
torch.save(data, 'data_drive_path{}'.format(idx))
torch.save(target, ...
This will run only once (and take some time of course).
After executing this loop, you would write a new Dataset to load these tensors instead of the jpeg images for the training case.
create a new script using your current ImageFolder approach and write a single loop over the complete training and validation dataset to store each elements in your drive (in the corresponding train/val folder)
Write a custom Dataset to load the tensors directly instead of the images via ImageFolder
I need to save the transformed images for better efficiency too. I tried saving the transformed tensors by torch.save, but a [3, 224, 224] image tensor takes about 100M memory? That’s irrational, why?
I am handling an image dataset with 100k images. I have tried saving the transformed image tensors into .jpg and .png images, they all look good. Then i use Image.open(imageFile).convert('RGB') and transforms.ToTensor() to read the saved images, but i can’t get the correct tensor. It’s different from the original transoformed image tensor, how did that happen and how can i get the correct tensor?