Dataloader manipulation

JaeGer · June 19, 2018, 12:29pm

Hello ,
I’m trying to use data loader , but can’t figure out how it works . I splitted my dataset 75% train 25% test , Now I used it like that (code) , My question how the data loader identify the label (obejective , class) , is it by default the last column in the tensor or do I have to specify it ?
P.S : New to PyTorch and DL , is it good to use shuffle for the training set ?

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

inkplay · June 19, 2018, 12:46pm

I don’t know if you implemented the train_dataset yourself but if you did you should notices that the train_dataset class must have 2 default special methods. One is to be able to call len and getitem which both help dataloader to navigate and grab the data set the way you wanted. I am experimenting with pytorch as well so feel free to correct me if I am wrong. As for shuffling the training data set it depends what you want to do. Eg if you want to train on sequential data you don’t want that on, if you want to learn a classification network then you should turn it on.

JaeGer · June 19, 2018, 1:00pm

Hello @inkplay Thank you for relying ,
To describe my dataset to you understand more , I have [ examples , dates , features ] (e.g. [1920,23,20])where the examples are agricultural lands and 20 features which are the mean and variance for different bandwidths ( 10 to be exact , 2 * 10 ) and these measurements are taken on different dates through out the year and my target (label , classes ) are [examples , dates , class ] e.g.([1920,23,1]) ,12 classes in total. After cleaning splitting and scaling my data (normalization) I want to train it on an RNN for classification ( simple tanh cells , just like you experimenting I’ll try after that GRU and LSTM ) but as far as I can see from different codes online everyone seem to use the data loader so can it be replace or how can specify to it that my class is this row in the tensor ?
and thank you.

JuanFMontesinos · June 19, 2018, 1:59pm

You have to define a dataset:
https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset
Dataset is an iterator prepared to work with dataloader

You have to set your dataset:

class BinaryData(torch.utils.data.Dataset):

    def __init__(self, root_dir,transform):
Define here basics of your dataset, such as directory. I enlist all the files inside my dataset and use it as list for reading data later.
        self.input_list = []
        self.transform = transform
        for path, subdirs, files in os.walk(root_dir):
            for name in files:
                self.input_list.append(os.path.join(path, name))

    def __len__(self):
        return len(self.input_list)

    def __getitem__(self, idx):
Here you have to load your data using whatever you want, you can see last one I did as an example. Here I was loading python dictionaries. You can return the data with the structure you want
        dic = data2dic(np.load(self.input_list[idx]))
        audio = dic['audio']
        if audio.shape !=(256,256):
            audio = np.resize(audio,(256,256))
        frames = dic['frames']
        size = np.shape(frames)
        images = []
        for i in range(size[3]):
            images.append(self.transform(Image.fromarray(frames[:,:,:,i])))
        frames = torch.stack(images)
        return audio,frames

Independently of how your data is stored, you can manage and modify it when you call the function. Gettiem is warped with dataloader so each time you call dataloder it will provide data with the structure of getitem’s return

inkplay · June 19, 2018, 4:42pm

Follow this tutorial https://pytorch.org/tutorials/beginner/data_loading_tutorial.html to create your dataset class, basically just switch the image dataset they use with your csv or text dataset file.