Load different sections of data at every epoch using data loader

ays · April 16, 2023, 11:00am

A ran a pretrained model on a bunch of data and stored the result path in a csv file. the csv file currently contains the following columns; epoch, index, input_data, result. Now in my main model I want train using this result from the pretrained model and for 500 epochs. This means I need to retrieve the result for a particular epoch from the csv file at every epoch of my training (so for epoch 1, I will get all data where epoch =1 from the csv file and pass it using dataloader and so on for all 500 epochs), is there a way of doing this using dataloader?

RohitRathore1 · April 16, 2023, 11:33am

You can achieve this by creating a custom dataset class that inherits from torch.utils.data.Dataset and implements the necessary methods. Inside this custom dataset, you can load the data for the specific epoch you’re interested in. Then, you can use a DataLoader to load the data in batches for training.

#Create a custom dataset class
import pandas as pd
import torch
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, csv_file, epoch, transform=None):
        self.data = pd.read_csv(csv_file)
        self.epoch_data = self.data[self.data['epoch'] == epoch]
        self.transform = transform

    def __len__(self):
        return len(self.epoch_data)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        
        input_data = self.epoch_data.iloc[idx]['input_data']
        result = self.epoch_data.iloc[idx]['result']
        
        if self.transform:
            input_data = self.transform(input_data)

        return input_data, result

#Create a DataLoader for each epoch and use it in your training loop
from torch.utils.data import DataLoader

csv_file = 'your_csv_file.csv'
num_epochs = 500
batch_size = 64

for epoch in range(1, num_epochs + 1):
    print(f"Training for epoch {epoch}")
    
    # Create the custom dataset for the current epoch
    dataset = CustomDataset(csv_file, epoch)
    
    # Create a DataLoader to load the data in batches
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=0)
    
    # Training loop
    for i, (input_data, result) in enumerate(dataloader):
        # Perform your training step here
        pass

Above code defines a custom dataset class that reads the data from a CSV file for a specific epoch. In the main training loop, the DataLoader is created for each epoch using the custom dataset. This way, you can load different sections of data for each epoch as required.

but remember that you should preprocess the input_data and result as needed to convert them into the appropriate format for your model.

ays · April 17, 2023, 9:38am

This makes sense. Thanks a lot!. Will the overall operation be computationally expensive though? seeing as I’m loading data at every epoch