I am working with multiple chemical spectra for a binary classification problem and I have multiple files with multiple spectra in each file. That is, a single .csv file contains many spectra, with each file containing the same number of spectra. For example, a single .csv file looks like this:

```
name id V1 V2 V3 ...
compound1 1 25 14 32 ...
compound2 2 21 10 12 ...
compound3 3 15 7 39 ...
```

The numerical portions of the file (`V1`

through `VN`

) will be used in the neural network. Each file has compounds that are all a single label, and the numerical portions of all files are the same. I have several thousand of these files. So, I’d like to create a dataloader that will accept a bunch of file paths, and return a dataloader, where each call to `next(iter(customDataloader))`

yields a data tensor and a label tensor such that the data tensor is size `(number_of_spectra_in_file, 1, number_of_columns_in_numerical_part)`

, and the label tensor is size `(number_of_spectra_in_file, 1)`

. That is each spectra should be considered as a single sample, and if I make a data loader with `batch_size = 16`

, then the corresponding data tensor should have size `(16*number_of_spectra_in_file, 1, number_of_columns_in_numerical_part)`

. Here’s my first attempt:

```
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, filePathDf, transform=None, target_transform=None):
self.transform = transform
self.target_transform = target_transform
self.filePathDf = filePathDf
def __len__(self):
return len(self.filePathDf)
def __getitem__(self, idx):
data = torch.tensor(
(
pd.read_csv(self.filePathDf.loc[idx, 'path'])
.drop('id', 'name')
.to_numpy()
),
dtype=torch.float
)
label = [self.filePathDf.loc[idx, 'label']] * data.size()[0]
if self.transform:
data = self.transform(data)
if self.target_transform:
label = self.target_transform(label)
return data, label
```

When I run this, I get data that is not the right size.

```
>> trainDataset = CustomDataset(trainFilesDf)
>> trainDataloader = torch.utils.data.DataLoader(trainDataset, batch_size=10)
>> trainData, trainLabels = next(iter(trainDataloader))
>> print(trainData.size())
torch.Size([10, 7, 2020])
```

Here, there are 7 spectra per file, and the numerical portions of the data are 2020 long. The variable `trainLabels`

is a list of 7 tensors, each of size 10. The variable `trainFilesDf`

is a pandas dataframe with 2 columns: `path`

and `label`

, where each file path has a corresponding label.

My question: Should I just reshape the data and labels or is there a better way to formulate my `CustomDataset`

class?