Modifying a custom dataset to replace datasets with new datasets

chaslie · September 10, 2020, 10:36am

morning all,

I have 2 custom data sets which is generated externally and both in the format [data1, data2, data3,data,4]. I use the following function combine the datasets into a single entity. where data1 is a [n,m] array

class DoubleDataset(Dataset):
    def __init__(self):
        data_a = torch.load('data1.pt')
        data_b = torch.load('data2.pt')
        self.data_1=torch.utils.data.ConcatDataset((data_a ,data_b ))
        
        data_c = torch.load('data3.pt')
        data_d= torch.load('data4.pt')
        self.data_2=torch.utils.data.ConcatDataset((data_c ,data_d))
    def __getitem__(self, index):
        return self.data_1[index], self.data_2[index]

    def __len__(self):
        return min(len(self.data_1), len(self.data_2))

This works fine, can load the data into my network, normalise by batch and everything is hunkey dorey.

I would like to try and normalise data_1 and data_2 by the dataset and not by batch but when i get the following error:

data_1.max()

AttributeError: 'ConcatDataset' object has no attribute 'max'

If i load the dataset into memory its possible to use the following to normalise the whole dataset, but is it possible to get this into the class? if so where does it go?

    batch_samples = data_1[0].size(0)
    data = data_1[0].view(batch_samples, data_1[0].size(1), -1)
    data2=(data-data.min())/(data.max()-data.min())
    data3=(data-0.5/0.5

Failing this, if i do this

def normalise_dataset(dataload):
    loader=DataLoader(dataload, batch_size=len(dataload), num_workers=0, shuffle=False)
    for (data_1,data_2) in loader:
        batch_samples = data_1[0].size(0)
        data = data_1[0].view(batch_samples, data_1[0].size(1), -1)
        data2=(data-data.min())/(data.max()-data.min())
        data_1_O=(data2-0.5)/0.5
        data_1[0]=data_1_O
 
        batch_samples = data_2[0].size(0)
        dataI = data_2[0].view(batch_samples, data_2[0].size(1), -1)
        dataI2=(dataI-dataI.min())/(dataI.max()-dataI.min())
        data_2_O=(dataI2-0.5)/0.5
        data_2[0]=data_2_O
        norm=torch.utils.data.ConcatDataset((data_1,data_2))
        return norm

norm=normalise_dataset(dataload)


test=DataLoader(
    norm,
    batch_size=25,
    num_workers=0,
    shuffle=False
)

for i, (data_1,data_2) in enumerate(test):
    test1=data_1[0]
    test2=data_2[0]

This normalises the whole dataset, however when i try and load test into the network i get


RuntimeError: Expected object of scalar type Float but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'

but the only elements that have changed are the data_1[0] & data_1[0], and the formatting of these are the same as the previous.

How do i repack the original dataload with the new values of data_1 & data_2?

Chaslie

PS, sorry for the bad coding…

hhaoao · September 11, 2020, 2:39am

I think you should sort it out, I don’t know whether data_1 is instantiated dataset or dataloader throughout the article.

The error display is very clear. I did a minimal demo. Before asking questions, you can make a small demo that can be run. Many times, when you coding, this problem is solved in the process.

import torch 

class MyDataset(torch.utils.data.Dataset): 
    def __init__(self):
        data_x = torch.tensor([[1,2,3],[2,3,4]])
        data_y = torch.tensor([[3,4,5],[4,5,6]])
        self.data_concat = torch.utils.data.ConcatDataset((data_x,data_y))

    def __getitem__(self, index):
        return self.data_concat[index]

    def __len__(self):
        return len(self.data_concat)

dataset_temp = MyDataset()

dataset_loader = torch.utils.data.DataLoader(dataset_temp)

for data in dataset_loader:
    print(data.max())

chaslie · September 11, 2020, 8:42am

hhaoao,

thanks for the reply, i think you may have missunderstood the problem, due to my bad explaination.

The problem I have is that when i take 2 datasets and combine them in the doubledataset class (which works fine) I cannot normalise the data across the whole dataset due to the concatination.

To get over this I then take the concatacted datasets and use the normalised function to normalise the whole dataset. This then gets recombined, however at this stage instead of being a dataset with entries [[a,b,c,d],[a’,b’,c’,d’]], i end up with [a,b], where the dataloader has length 1 instead of the size of the dataset and a and be have size of the length of the dataset.

eg:
length of dataset 1000, batchsize=1 .
what i want is dataloader length 1000 and dataset[0] is [[a,b,c,d],[a’,b’,c’,d’]] where each entry is a numerical 2D arrray and some other metadata.

what i am ending up with is:
length of dataset 1000, batchsize=1 .
what i want is dataloader length 1and dataset[0] is [[a,],[a’]] where each entry just the a numerical 2D arrray.

to compond matters, datasets a and a’ before combining are of size [a,b,c,d] and [a’,b’,c’,d’] but they are only combining a & a’.

I aplogise if this is confusing, I am confused. the problem is something to with how i am recreating the datasets, but with torch.utils.data.ConcatDataset the options are severly restricting…

thanks for your patience.

chaslie

hhaoao · September 11, 2020, 11:59pm

Oh my goodness, your problem is the same as the one I encountered before.Pytorch cannot solve these data processing problems, you just need to give up using pytorch to process data.I used python zip() to parallel output.

PS: If I still misunderstood your question, I think it is all a data processing problem, you have to find a solution elsewhere. As my link said, you need to change the direction of solving the problem yourself.