Why do we wait to convert pandas dataframe

matthewleigh · February 20, 2019, 11:18am

Why do all the tutorials use the pandas dataframe to store their input data, only converting it to a tensor or numpy array in the getitem_ function?

For both my training and test sets, each 500 000 examples with 67 scalar features, it seems that converting the whole object first to a tensor uses less memory and is much quicker.

These are my two versions of my custom Dataset object, the first is in line with the tutorials. A single epoch takes 53 seconds, and the program in total uses 2.6 GB of RAM while training.

class EventsDataset(Dataset):
    def __init__(self, root_folder, data_set_name, numrows = None, mns_file_name = None ):
        self.my_file_name = os.path.join( root_folder, data_set_name )
        self.file_data = pd.read_csv(self.my_file_name, nrows = numrows )

    def __len__(self):
        return len(self.file_data)

    def __getitem__(self, idx):
        truth_data = torch.from_numpy(self.file_data.iloc[idx, 1:3].values.astype(np.float32))
        recon_data = torch.from_numpy(self.file_data.iloc[idx, 4:].values.astype(np.float32))
        return recon_data, truth_data

Using the following custom Dataset uses only 2.3 GB of RAM while training and each epoch is reduced to only 7 seconds!

class EventsDataset(Dataset):
    def __init__(self, root_folder, data_set_name, numrows = None, mns_file_name = None ):
        self.my_file_name = os.path.join( root_folder, data_set_name )
        file_data = pd.read_csv(self.my_file_name, nrows = numrows )
        self.tensor_data = torch.from_numpy( file_data.values.astype(np.float32)  )
        del file_data

    def __len__(self):
        return len(self.tensor_data)

    def __getitem__(self, idx):
        truth_data = self.tensor_data[idx, 1:3]
        recon_data = self.tensor_data[idx, 4:]
        return recon_data, truth_data

So whats going on here. Is my example a special case? Why do we want to use the pandas dataframe and why do we withhold converting it to a tensor only in the getitem call?

ptrblck · February 20, 2019, 4:21pm

Usually you would like to avoid long startup times and thus push the data loading and processing to the __getitem__ method. If you are using a DataLoader, multiple workers can load the data in the background while your GPU is busy training, which might hide the loading time.

Currently you are only timing one epoch, which also doesn’t consider the initialization of your Dataset.
However, if you just would like to slice the pd.DataFrame without any fancy indexing, it might indeed be faster to transform it to a tensor beforehand.

matthewleigh · February 20, 2019, 5:07pm

Thanks!

It does seem that it is faster this way since I have really simple datasets, and other than shuffling, there is no augmentation. So using basic tensors seems the way to go for my problem.