Custom dataloader for multiple samples in single file

CopyOfA · August 16, 2021, 6:25pm

I am working with multiple chemical spectra for a binary classification problem and I have multiple files with multiple spectra in each file. That is, a single .csv file contains many spectra, with each file containing the same number of spectra. For example, a single .csv file looks like this:

name             id   V1   V2   V3 ...
compound1        1    25   14   32 ...
compound2        2    21   10   12 ...
compound3        3    15   7    39 ...

The numerical portions of the file (V1 through VN) will be used in the neural network. Each file has compounds that are all a single label, and the numerical portions of all files are the same. I have several thousand of these files. So, I’d like to create a dataloader that will accept a bunch of file paths, and return a dataloader, where each call to next(iter(customDataloader)) yields a data tensor and a label tensor such that the data tensor is size (number_of_spectra_in_file, 1, number_of_columns_in_numerical_part), and the label tensor is size (number_of_spectra_in_file, 1). That is each spectra should be considered as a single sample, and if I make a data loader with batch_size = 16, then the corresponding data tensor should have size (16*number_of_spectra_in_file, 1, number_of_columns_in_numerical_part). Here’s my first attempt:

class CustomDataset(torch.utils.data.Dataset):
   def __init__(self, filePathDf, transform=None, target_transform=None):
      self.transform = transform
      self.target_transform = target_transform
      self.filePathDf = filePathDf

   def __len__(self):
      return len(self.filePathDf)

   def __getitem__(self, idx):
      data = torch.tensor(
         (
            pd.read_csv(self.filePathDf.loc[idx, 'path'])
            .drop('id', 'name')
            .to_numpy()
         ),
         dtype=torch.float
      )

      label = [self.filePathDf.loc[idx, 'label']] * data.size()[0]

      if self.transform:
         data = self.transform(data)
      if self.target_transform:
         label = self.target_transform(label)

      return data, label

When I run this, I get data that is not the right size.

>> trainDataset = CustomDataset(trainFilesDf)
>> trainDataloader = torch.utils.data.DataLoader(trainDataset, batch_size=10)
>> trainData, trainLabels = next(iter(trainDataloader))
>> print(trainData.size())

torch.Size([10, 7, 2020])

Here, there are 7 spectra per file, and the numerical portions of the data are 2020 long. The variable trainLabels is a list of 7 tensors, each of size 10. The variable trainFilesDf is a pandas dataframe with 2 columns: path and label, where each file path has a corresponding label.

My question: Should I just reshape the data and labels or is there a better way to formulate my CustomDataset class?