Loading pickle files with pytorch dataloader

Hello Everyone,

I am using the intermediate output of a pretrained CNN model as input to my model. The input to the pretrained CNN model is a color image. I don’t want to compute the intermediate output every time. So, I have saved the intermediate output (60x256x45x80) in pickel format(.pt) using toarch.save(intermediate output). Now i get a bunch of pickel files.

Now, I want to directly load these pickel files(total of 230) using the pytorch dataloader and use it as input to my model and train my model. So, i use the code below.
class pickle_Dataloader(data.Dataset):

def __init__(self,  root = '.', iext = '.pt', transforms=None):
    self.transforms = transforms
    self.image_list = []
    arr = os.listdir(root)

    for i in range(len(arr)):
        filename = root + '/' + arr[i]
        with open(filename, 'rb') as fp:
            item = torch.load(fp)
        self.image_list += item
    self.size = len(self.image_list)
def __getitem__(self, index):

    # index = index % self.size
    images = self.image_list[index]

    return images

def __len__(self):
    return self.size 

train_data_pkl = pickle_Dataloader(root = ‘.’)
Trainloader_pkl = torch.utils.data.DataLoader(train_data_pkl, shuffle = True , batch_size = BATCH_SIZE , num_workers = 4, drop_last=False)


Is this the correct way to load the pickel files with dataloader?

But, I get the RuntimeError: “CUDA out of memory. Tried to allocate 212.00 MiB (GPU 0; 10.92 GiB total capacity; 10.09 GiB already allocated; 9.44 MiB free; 10.14 GiB reserved in total by PyTorch)”

I tried varying the batch size. But, i get the same error.

I am using Python3 and torch version 1.9.0+cu102 on linux environment.

Can you please tell me where am i going wrong?

Thank you

The custom dataset you’ve written its probably better to name it pickle_Dataset as Dataset and DataLoader are two separate things and make the question confusing.

In the dataset class you’ve written notice how’re you’re loading all the data in the constructor.
Try loading a image in the getiitem method instead.
( create a list of all files using os.listdir in the init method and load in getitem by indexing the name)

How to avoid opening the pickle file at once within the dataloader? Can it done with a pointer?

How to just load one sample / batch after another from the pickle file?

My second question would be if I can save data differently. Not as pickle. That makes loading more easier?

Similar thread: What is the best way to load a large numpy array as PyTorch dataset?