I am working with multiple chemical spectra for a binary classification problem and I have multiple files with multiple spectra in each file. That is, a single .csv file contains many spectra, with each file containing the same number of spectra. For example, a single .csv file looks like this:
name id V1 V2 V3 ... compound1 1 25 14 32 ... compound2 2 21 10 12 ... compound3 3 15 7 39 ...
The numerical portions of the file (
VN) will be used in the neural network. Each file has compounds that are all a single label, and the numerical portions of all files are the same. I have several thousand of these files. So, I’d like to create a dataloader that will accept a bunch of file paths, and return a dataloader, where each call to
next(iter(customDataloader)) yields a data tensor and a label tensor such that the data tensor is size
(number_of_spectra_in_file, 1, number_of_columns_in_numerical_part), and the label tensor is size
(number_of_spectra_in_file, 1). That is each spectra should be considered as a single sample, and if I make a data loader with
batch_size = 16, then the corresponding data tensor should have size
(16*number_of_spectra_in_file, 1, number_of_columns_in_numerical_part). Here’s my first attempt:
class CustomDataset(torch.utils.data.Dataset): def __init__(self, filePathDf, transform=None, target_transform=None): self.transform = transform self.target_transform = target_transform self.filePathDf = filePathDf def __len__(self): return len(self.filePathDf) def __getitem__(self, idx): data = torch.tensor( ( pd.read_csv(self.filePathDf.loc[idx, 'path']) .drop('id', 'name') .to_numpy() ), dtype=torch.float ) label = [self.filePathDf.loc[idx, 'label']] * data.size() if self.transform: data = self.transform(data) if self.target_transform: label = self.target_transform(label) return data, label
When I run this, I get data that is not the right size.
>> trainDataset = CustomDataset(trainFilesDf) >> trainDataloader = torch.utils.data.DataLoader(trainDataset, batch_size=10) >> trainData, trainLabels = next(iter(trainDataloader)) >> print(trainData.size()) torch.Size([10, 7, 2020])
Here, there are 7 spectra per file, and the numerical portions of the data are 2020 long. The variable
trainLabels is a list of 7 tensors, each of size 10. The variable
trainFilesDf is a pandas dataframe with 2 columns:
label, where each file path has a corresponding label.
My question: Should I just reshape the data and labels or is there a better way to formulate my