I am working with multiple chemical spectra for a binary classification problem and I have multiple files with multiple spectra in each file. That is, a single .csv file contains many spectra, with each file containing the same number of spectra. For example, a single .csv file looks like this:
name id V1 V2 V3 ...
compound1 1 25 14 32 ...
compound2 2 21 10 12 ...
compound3 3 15 7 39 ...
The numerical portions of the file (V1
through VN
) will be used in the neural network. Each file has compounds that are all a single label, and the numerical portions of all files are the same. I have several thousand of these files. So, I’d like to create a dataloader that will accept a bunch of file paths, and return a dataloader, where each call to next(iter(customDataloader))
yields a data tensor and a label tensor such that the data tensor is size (number_of_spectra_in_file, 1, number_of_columns_in_numerical_part)
, and the label tensor is size (number_of_spectra_in_file, 1)
. That is each spectra should be considered as a single sample, and if I make a data loader with batch_size = 16
, then the corresponding data tensor should have size (16*number_of_spectra_in_file, 1, number_of_columns_in_numerical_part)
. Here’s my first attempt:
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, filePathDf, transform=None, target_transform=None):
self.transform = transform
self.target_transform = target_transform
self.filePathDf = filePathDf
def __len__(self):
return len(self.filePathDf)
def __getitem__(self, idx):
data = torch.tensor(
(
pd.read_csv(self.filePathDf.loc[idx, 'path'])
.drop('id', 'name')
.to_numpy()
),
dtype=torch.float
)
label = [self.filePathDf.loc[idx, 'label']] * data.size()[0]
if self.transform:
data = self.transform(data)
if self.target_transform:
label = self.target_transform(label)
return data, label
When I run this, I get data that is not the right size.
>> trainDataset = CustomDataset(trainFilesDf)
>> trainDataloader = torch.utils.data.DataLoader(trainDataset, batch_size=10)
>> trainData, trainLabels = next(iter(trainDataloader))
>> print(trainData.size())
torch.Size([10, 7, 2020])
Here, there are 7 spectra per file, and the numerical portions of the data are 2020 long. The variable trainLabels
is a list of 7 tensors, each of size 10. The variable trainFilesDf
is a pandas dataframe with 2 columns: path
and label
, where each file path has a corresponding label.
My question: Should I just reshape the data and labels or is there a better way to formulate my CustomDataset
class?