Why do all the tutorials use the pandas dataframe to store their input data, only converting it to a tensor or numpy array in the getitem_ function?
For both my training and test sets, each 500 000 examples with 67 scalar features, it seems that converting the whole object first to a tensor uses less memory and is much quicker.
These are my two versions of my custom Dataset object, the first is in line with the tutorials. A single epoch takes 53 seconds, and the program in total uses 2.6 GB of RAM while training.
class EventsDataset(Dataset):
def __init__(self, root_folder, data_set_name, numrows = None, mns_file_name = None ):
self.my_file_name = os.path.join( root_folder, data_set_name )
self.file_data = pd.read_csv(self.my_file_name, nrows = numrows )
def __len__(self):
return len(self.file_data)
def __getitem__(self, idx):
truth_data = torch.from_numpy(self.file_data.iloc[idx, 1:3].values.astype(np.float32))
recon_data = torch.from_numpy(self.file_data.iloc[idx, 4:].values.astype(np.float32))
return recon_data, truth_data
Using the following custom Dataset uses only 2.3 GB of RAM while training and each epoch is reduced to only 7 seconds!
class EventsDataset(Dataset):
def __init__(self, root_folder, data_set_name, numrows = None, mns_file_name = None ):
self.my_file_name = os.path.join( root_folder, data_set_name )
file_data = pd.read_csv(self.my_file_name, nrows = numrows )
self.tensor_data = torch.from_numpy( file_data.values.astype(np.float32) )
del file_data
def __len__(self):
return len(self.tensor_data)
def __getitem__(self, idx):
truth_data = self.tensor_data[idx, 1:3]
recon_data = self.tensor_data[idx, 4:]
return recon_data, truth_data
So whats going on here. Is my example a special case? Why do we want to use the pandas dataframe and why do we withhold converting it to a tensor only in the getitem call?