How to load data from a .csv

Jordan_Howell · October 15, 2019, 6:59pm

I have a csv that contains a column of image file names, target labels and location of each file. Just leaving tensorflow-keras side, I am admittedly a newbie to pytorch. How do I code a dataloader to read the csv, and pull the images, randomly split off a test set and finally having a train and test set to pull in batches?

CSV columns are as such:
location: directory of where the image sits (includes image file name)
file name: image file name
target: target variable consisting of 1 or 0

Jordan_Howell · October 15, 2019, 7:47pm

I tried the folloiwng based off of https://jhui.github.io/2018/02/09/PyTorch-Data-loading-preprocess_torchvision/ but got an error. Code as below:

class Roof_Dataset(Dataset):
“”“Roof dataset.”“”

def __init__(self, csv_file, root_dir, transform):
    """
    Args:
        csv_file (string): Path to the csv file with annotations.
        root_dir (string): Directory with all the images.
        transform (callable, optional): Optional transform to be applied
            on a sample.
    """
    
    self.roofs_frame = pd.read_csv(csv_file)
    self.root_dir = root_dir
    self.transform = transform

def __len__(self):
    return len(self.landmarks_frame)

def __getitem__(self, idx):
    img_name = Path(self.root_dir,
                            self.roof_frame.iloc[idx, 1])
    image = io.imread(img_name)
    landmarks = self.landmarks_frame.iloc[idx, 5]
    sample = {'image': image, 'landmarks': landmarks}

    if self.transform:
        sample = self.transform(sample)

    return sample

I get the following;

NameError Traceback (most recent call last)
in
----> 1 class Roof_Dataset(Dataset):
2 “”“Roof dataset.”“”
3
4 def init(self, csv_file, root_dir, transform):
5 “”"

NameError: name ‘Dataset’ is not defined

I’m also not sure, how I would have the data loader randomly cut off 20% for a test set. Am I headed in the wrong direction?

JamesTrick · October 15, 2019, 8:18pm

Hi! Have you imported Dataset? You can do this by adding: from torch.utils.data import Dataset.

In terms of splitting off a validation set - you’ll need to do this outside the dataset. It’s probably easiest to use sklearns train_test_split. For example:

from sklearn.model_selection import train_test_split

train, val = train_test_split("full.csv", test_size=0.2)
train.to_csv("train.csv"), val.to_csv("val.csv")

train_dataset = Roof_dataset(csv_file="train.csv") . # Add any other params such as transforms here
val_dataset = Roof_dataset(csv_file="val.csv") # Again add any other params

avinash08 · February 20, 2020, 5:19pm

I am also facing same kind of difficulty. my csv file is different from images csv as my csv contains numerical data which will be feed to RNN model but i can’t find any documenatation to do that .

so as newbie I ask for your help . Thanks in advance.

Jordan_Howell · February 22, 2020, 7:34pm

Hello @avinash08. I was out of town. When I get back to my desk on Monday, I’ll reply with what worked.

Jordan_Howell · February 24, 2020, 10:27am

I ended up combining into one csv with the location of my images as a column in my tabular data. I have multiple images per record which means I take the mean prediction of one observation/mulitple images in the end. below is the code that worked on to pull the data.

class image_Dataset(Dataset):
‘’’
image class data set

'''
def __init__(self, data, transform = None):
    '''
    Args:
    ------------------------------------------------------------
        data = dataframe
        image = column in dataframe with absolute path to the image
        label = column in dataframe that is the target classification variable
        numerical_columns =  numerical columns from data
        categorical_columns = categorical columns from data
        policy = ID variable
        
    '''
    self.image_frame = data
    self.transform = transform
    
def __len__(self):
    return len(self.image_frame)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
     
    label = self.image_frame.loc[idx, 'target']
    pic = Path(self.image_frame.loc[idx,'location'])
    img = Image.open(pic)
    policy = self.image_frame.loc[idx, 'policy']
    #sample = {'image': img, 'policy': policy, 'label':label}
    numerical_data = self.image_frame.loc[idx, numerical_columns]
    
    numerical_data = torch.tensor(numerical_data, dtype = torch.float)

    if self.transform:
        image = self.transform(img)
        
    for category in categorical_columns:
        self.image_frame[category] = self.image_frame[category].astype('category')
        
        self.image_frame[category] = self.image_frame[category].astype('category').cat.codes.values
    
        
    categorical_data = self.image_frame.loc[idx, categorical_columns]
    categorical_data = torch.tensor(categorical_data, dtype = torch.int64)
        
    return image, label, policy, categorical_data , numerical_data

avinash08 · February 24, 2020, 12:13pm

Thanx a lot . I was stuck in like for days just loading csv to model.

Arnaud_Mal · April 21, 2020, 10:43am

Hello,

I have a similar problem (link) and I created a discussion for it.

I was able to create a CustomDataset that return an image and a label (both tensor). Then I pass them to the Dataloader, but then, we I get the Image and Target from the Dataloader in the BackPropagation, the size is not right.

The CustomDataset is:

class CustomDataset(Dataset):
    def __init__(self, csv_file, id_col, target_col, root_dir, sufix=None, transform=None):
        """
        Args:
            csv_file   (string):             Path to the csv file with annotations.
            root_dir   (string):             Directory with all the images.
            id_col     (string):             csv id column name.
            target_col (string):             csv target column name.
            sufix      (string, optional):   Optional sufix for samples.
            transform  (callable, optional): Optional transform to be applied on a sample.
        """
        self.data      = pd.read_csv(csv_file)
        self.id        = id_col
        self.target    = target_col
        self.root      = root_dir
        self.sufix     = sufix
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # get the image name at the different idx
        img_name = self.data.loc[idx, self.id]
        
        # if there is not sufic, nothing happened. in this case sufix is '.jpg'
        if self.sufix is not None:
            img_name = img_name + self.sufix
        
        # it opens the image of the img_name at the specific idx
        image = Image.open(os.path.join(self.root, img_name))
        
        # if there is not transform nothing happens, here we defined below two transforms for train and for test
        if self.transform is not None:
            image = self.transform(image)
        
        # define the label based on the idx
        #label = self.data.loc[idx, self.target].values
        #label = torch.from_numpy(label.astype(np.int8))
        #label = label.squeeze(-1)
        
        #Test second option
        
        label_test = self.data.iloc[idx, 1:5].values.astype('float32')
        
        return image, label_test

and the data_transforms and params are as below

data_transforms = {
    'train': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
    ]),
    'test': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
    ])
}

params = {
    'id_col':     'image_id',  
    'target_col': ['healthy', 'multiple_diseases', 'rust', 'scab'],
    'sufix':      '.jpg',
    'transform':  data_transforms['train']
}

train_dataset = CustomDataset(csv_file=data_dir+'train.csv', root_dir=data_dir+'images', **params)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

As you see, the target is four categories where the image is identified. Like this

pdFrame

My issue is when I do the backpropagation with the dataloader, I have the wrong target size ([16]) instead of ([4]).

The training is like this


def train2(n_epochs, loaders, model, optimizer, criterion):
    """returns trained model"""
    # initialize tracker for minimum validation loss
    valid_loss_min = np.Inf 
    
    for epoch in range(1, n_epochs+1):
        # initialize variables to monitor training and validation loss
        train_loss = 0.0
        valid_loss = 0.0
        
        ###################
        # train the model #
        ###################
        model.train()
        for idx, (data, target) in enumerate(loaders):

            ## find the loss and update the model parameters accordingly
            ## record the average training loss, using something like
            ## train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
            optimizer.zero_grad()
            # forward pass: compute predicted outputs by passing inputs to the model
            output = model(data)
            print(idx)
            target.view(-1)
            print(target.shape)
            target = target.long()
            loss = criterion(output, target)
            # backward pass: compute gradient of the loss with respect to model parameters
            loss.backward()
            # perform a single optimization step (parameter update)
            optimizer.step()
            #update training loss
            train_loss += loss.item()*data.size(0)
            
        # calculate average losses
        train_loss = train_loss/len(loaders.sampler)
        # print training/validation statistics 
        print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
            epoch, 
            train_loss,
            ))
            
    # return trained model
    return model

This is the link to the github so I can track my progress so you can have the full picture.

Any ideas, suggestions?

ptrblck · April 22, 2020, 7:28am

Which batch size are you using for the DataLoaders and could you please print the shape of target before the view(-1) operation?

Arnaud_Mal · April 22, 2020, 10:55am

The batchsize is 4:

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)

The shape of the Target is torch.Size([4, 4])

Originally I wanted to use something like:

abel = pd.read_csv(csv_file).loc[idx, ['healthy', 'multiple_diseases', 'rust', 'scab']].values
label = torch.from_numpy(label.astype(np.int8))

But this doesn’t work neither

Eventually, I found another way to do it (but I am not sure whether it is right, as I am still not able to have decent predictions on the test data), in the CustomDataset I replace the label related operation by:

label = self.data.iloc[idx, 1:5].values.astype('int64')
label = np.argwhere(label ==1)
label = label.item(0)
label = torch.tensor(label)

The label shape is 'torch.Size([ ]), which gives me a tensor of the shape torch.Size([4]) from the Dataloader. This way the training works with a simple:

for idx, (data, target) in enumerate(loaders):

            ## find the loss and update the model parameters accordingly
            ## record the average training loss, using something like
            ## train_loss = train_loss + ((1 / (batch_idx + 1)) * (loss.data - train_loss))
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)

but I am not sure it is the proper way to do it, and I would like to be sure to understand the reason why the previous solution didn’t work.

Thanks

ptrblck · April 22, 2020, 7:52pm

Thanks for the update!
The previous solution wasn’t working, since your target was one-hot encoded ([ 4, 4]) and you’ve flattened it to a tensor of [16], which created the shape mismatch.
nn.CrossEntropyLoss expects the target to contain the class indices, which can be created via torch.argmax(one_hot_target, dim=1). Your approach with numpy also seems to be correct.

saba · September 14, 2020, 12:38am

Hi Ptrblck,

What is the best way to save the results (numbers , vectors, matrix) with the same type ?and reload them again ?

ptrblck · September 14, 2020, 4:44am

Assuming this data is stored as PyTorch tensors, you can use torch.save and torch.load.
If that’s not the case, you could either use Python directly to store these values or e.g. numpy.

saba · July 22, 2021, 4:20am

Hi Ptrblck,

I am loading the training data saved with .csv. It is not the first time that I use torch.load (…csv). on the GPU. However, this time it gave me this error :zip archive (did you mean to use torch.jit.load()?)
.

On another code the same command is working without any error. but on other it gave me this error and when I use the recommended load command it gave me another strange error.

I wondering if you know what is teh best way of loading for this problem?

Many thanks

ptrblck · July 22, 2021, 5:35am

I guess the error might be raised if you are mixing different PyTorch versions or torch.save with torch.jit.save. Could this be the case?

saba · July 22, 2021, 5:57am

i don’t think so. because the other code is running without error.

can I save my tensor data by using the following command use no (.csv) at the end?

Path=root_diR1+'/'+'Itr='+str(ii)+'Gen='+str(GEN)+'Num='+str(Num)+'Coef='+str(CoeBL)

torch.save(TrainPatchGAN,Path,_use_new_zipfile_serialization=False))

ptrblck · July 22, 2021, 6:33am

Yes, you can store tensors using this command. If you are struggling with the csv file, feel free to post an executable code snippet to reproduce this issue.