Pytorch customized data loader

Daniel_Joseph · November 20, 2018, 1:16pm

Hello everyone, I have just started learning Pytorch and i got a problem while trying to create dataloader from my customized datasets which is contains 20 files “data_x.mat” stored in specific folder, I need to use them with data-loader, can anyone help me how to create this class to get iterable batches :
Mydata is

10 files each is dict {“Training_Patches”: shape(760,120,21,21) / “Label”: shape(1,760)} stored in .mat file
10 files each is dict {“Testing_Patches”: shape(420,120,21,21) / “Label”: shape(1,420)} stored in .mat file
Here is what i thing it should looks like, Any suggestion,ideas, help would be Appreciated

class Datasets(Dataset):
    def __init__(self):
        self.tensors = []
        self.labels = []
       for i  in range(len(data_i["Training_patches"])):
                    self.tensors.append(data_i["Training_patches"])
                    self.labels.append(data_i["labels"])

    def __getitem__(self, index):
        # return one item on the index
        return

    def __len__(self):
        # return the data length
        return

ptrblck · November 20, 2018, 1:46pm

Would you like to load each file as one sample or is a whole batch saved in each file?
Were you able to load the data using scipy.io.loadmat?
Here is a dummy example for your dataset:


class MyDataset(Dataset):
    def __init__(self, mat_paths):
        self.paths = mat_paths
        
    def __getitem__(self, index):
        # Load .mat
        data = io.loadmat(self.paths[index])
        x = torch.from_numpy(data['Training_patches'])
        y = torch.from_numpy(data['labels'])
        
        return x, y
    
    def __len__(self):
        return len(self.paths)

Daniel_Joseph · November 21, 2018, 11:59am

Thanks ptrblck for replying , what i need is to load the whole files in the directory as one sampler to use it with the torch.Dataloader as follow :

Train_dst = MyDataset(path=path, train=True)
Test_dst = MyDataset(path=path, train=False)

TrainLoader = torch.utils.data.DataLoader(dataset=Train_dst, batch_size=50, shuffle=True )
TestLoader = torch.utils.data.DataLoader(dataset=Test_dst, batch_size=50, shuffle=True )

ptrblck · November 21, 2018, 1:39pm

I’m not sure I understand your use case properly.
If each file should be a sample, you could only use a batch size of 10, as you only have 10 files.
Did my code sample work for you?

keyu_Chen · September 26, 2019, 7:01pm

I have a Image.mat file with 2656 image samples.
It has 4 key elements, “img” key contains the data with size(2656, 4097), the last column is the class label.
How can I build a dataset based on this .mat file?

ptrblck · September 26, 2019, 7:27pm

I assume the data fits completely into memory?
If so, you could load the .mat file using scipy in your Dataset's __init__ method, get a single sample in __getitem__ by indexing and returning the tensor via torch.from_numpy.

saba · November 11, 2020, 11:55pm

Hi Ptrblck,

I wrote my custom data loader based on your guidance. The code is correct and work. The problem is that by changing the BatchSize, It does not work properly, every time it gives me data in 64 Batchsize regardless of the Batchsize number. Would you please help me with that?

workers=2
Batchsize=128

datarootNorm='//homeHealthytorchpatch_v1/'

datasetNorm=DatasetLoad(datarootNorm,transforms=transform,debug=False,ii=1)

dataloaderNormal = torch.utils.data.DataLoader(datasetNorm, batch_size=Batchsize,shuffle=True, num_workers=workers)


class DatasetLoad():

    def __init__(self, root_dirTrain,transforms, debug,ii):
        self.patches, self.labels = None, None
        self.ii=ii
        patch_path = os.path.join(root_dirTrain,'Healthyzone'+str(self.ii)+".csv")
        ## 15000x1x64x64
        self.patches=torch.load(patch_path) 
        # self.patches = scipy.io.loadmat(patch_path)['ImagesNormal']
        # patch_path = os.path.join(root_dirTrain,"NegTraining11_v1"+".mat")
        # self.patches = scipy.io.loadmat(patch_path)['NegTraining11_v1']
        self.transforms = transforms
        self.debug = debug

    def __getitem__(self, index):

        patchF = self.patches[index,:, : ,:]

        if self.transforms is not None:

            patchF = self.transforms(patchF)

        return patchF


    def __len__(self):
        return self.patches.shape[-1]

ptrblck · November 12, 2020, 5:54am

What shape does self.patches have? Is it [15000, 1, 64, 64]?
If so, I assume dim0==15000 are the number of samples?
If that’s also correct, I think you are returning the wrong __len__, as self.patches.shape[-1] would return 64 instead of 15000, so your max. batch size would be 64 and you would only load patches[:64].

saba · November 12, 2020, 7:23am

Many thanks let me check

saba · November 15, 2020, 11:38pm

yes you are right I correct it. I really appreciate.

saba · March 29, 2021, 6:16am

Hi Ptrblck,

I am using spyder editor.
My code was working but now when I want to use data loader and get the data from that it gives me this error"
[SpyderKernelApp] WARNING | WARNING: attempted to send message from fork
{‘header’: {‘version’: ‘5.3’, ‘date’: datetime.datetime(2021, 3, 29, 6, 7, 4, 493062, tzinfo=datetime.timezone.utc), ‘session’: ‘5b99e49d-1c858153ee7c8a8ade190ea5’, ‘username’: ‘mom008’, ‘msg_type’: ‘comm_msg’, ‘msg_id’: ‘5b99e49d-1c858153ee7c8a8ade190ea5_84’}, ‘msg_id’: ‘5b99e49d-1c858153ee7c8a8ade190ea5_84’, ‘msg_type’: ‘comm_msg’, ‘parent_header’:"
The code is

for ii in range(1,7,1):
    
    datarootAbNorm= '//home/mom008//work/RoiClassification/radius_32_32/selectedROIs/zone_'+str(ii)+'/'
    
    datasetAbNorm=CustomDataSetZoneAbnormal(datarootAbNorm,transforms=transform,zone=ii)

    dataloaderAbnormal = torch.utils.data.DataLoader(datasetAbNorm, batch_size=1,shuffle=True, num_workers=workers)
    
    for i, data1 in enumerate(dataloaderAbnormal, 0):

ptrblck · March 29, 2021, 6:32am

Interactive interpreters (such as the one used in Spyder) don’t work well with multiprocessing, so you could use single workers while (live) developing in Spyder and add multiprocessing later (and execute the script from the terminal or as a standalone script in the IDE).

saba · March 29, 2021, 7:17am

I think the problem is for this part. I want to load the image which the name is in the defined list (not all). and with this command nothing for some times nothing is passed.

class CustomDataSetZoneAbnormal_HMScan():
    def __init__(self, main_dir, dirusedList,transforms,zone):
        self.main_dir = main_dir
        self.zone=zone
        self.transforms = transforms
        self.dirusedList=np.load(dirusedList)

        all_imgs = os.listdir(main_dir)
        self.total_imgs = sorted(all_imgs)
       


    def __getitem__(self, idx):
                    
        cc=self.total_imgs[idx]
        ccSplit= cc.split('.')
        ccSplitsec=ccSplit[-2].split('_')
        ccSplitTh=ccSplitsec[:-2]
        for ii in range(len(self.dirusedList)):

            if ccSplitTh==self.dirusedList[ii]:

                img_loc = os.path.join(self.main_dir, self.total_imgs[idx])

                image11=mpimg.imread(img_loc).astype(float)

                return image11
           
            else:
                continue    
    
    def __len__(self):
        return len(self.total_imgs)

hamidreza_hayatjou · January 4, 2022, 8:23am

Hello everyone, I have a Image .mat file with 60556 image samples. It has 2 key elements, “image_data” and “image_labels”. When I execute the following command, the result is as follows :

data = io.loadmat('/dataset/Train48d48.mat')
image_data = data['image_data']
image_labels = np.squeeze(data['image_labels'])
print(image_data.shape, image_labels.shape)

(60556, 48, 48, 3) (60556,)
i want build a dataloader. my code is :

class MYDATASET(Dataset):
    def __init__(self, mat_paths):
        self.mat_paths ='/dataset/Train48d48.mat'
        
    def __getitem__(self, index):

        data = io.loadmat(self.mat_paths[index])
        x = data['image_data']
        y = np.squeeze(data['image_labels'])
        x = torch.from_numpy(x)
        y = torch.from_numpy(y)       
        return x, y
    
    def __len__(self):
        return len(self.mat_paths)

trainset = MYDATASET(mat_paths='/dataset')
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

I was expecting after executing this command :

one_train_batch_imgs, one_train_batch_lbls = next(iter(trainloader))
print(one_train_batch_imgs.shape)

See the result below.
torch.Size([128, 3, 48, 48])
But I have an error like this : " FileNotFoundError: [Errno 2] No such file or directory: ‘d.mat’ "
Where is my mistake? Any suggestion,ideas, help would be Appreciated.

my3bikaht · January 4, 2022, 10:29am

this path shouldn’t work in first place. You are trying to load something from the root subdirectory (add . in front to point to current dir)

Another thing is:

You are loading whole dataset every time single item is requested by the dataloader. Consider moving this part to __init__, otherwise it will bottleneck your loop.

hamidreza_hayatjou · January 4, 2022, 6:45pm

Thanks for your attention. But unfortunately I did not get a result.

my3bikaht · January 4, 2022, 7:11pm

Ah, I see.
First, I don’t think you can load specific element from *.mat file by providing [index]. You are loading whole dataset every time, that’s why I recommended to move loading part to __init__ to do that only once.

Second, self.mat_paths[index] in io.loadmat(self.mat_paths[index]) will be processed as a string and return you letter at position index. Since your path is '/dataset/Train48d48.mat', dataloader requested item with index 1 or 17, which is letter d, then loadmat function appended .mat as file extension.

It should be something like this

class MYDATASET(Dataset):
    def __init__(self, mat_paths):
        self.data = io.loadmat(mat_paths)
    def __getitem__(self, index):
        x = self.data['image_data'][index]
        y = np.squeeze(self.data['image_labels'])[index]
        x = torch.from_numpy(x)
        y = torch.from_numpy(y)       
        return x, y
    def __len__(self):
        return len(self.data['image_data'])

hamidreza_hayatjou · January 5, 2022, 9:58am

thank you Sergey. You helped me a lot. I was able to achieve the result with a few changes. I will leave the code for those who may see it later.

class MYDATASET(data.Dataset):
    def __init__(self, mat_paths, transform=None):

        self.mat_paths = '/dataset/Train48d48.mat'
        self.data = io.loadmat(mat_paths)
        self.transform = transform

    def __getitem__(self, index):
        x = self.data['image_data'][index]
        y = np.squeeze(self.data['image_labels'])[index]

        if self.transform :
            x = self.transform(x)     
        return x, y

    def __len__(self):
        return len(self.data['image_data'])

transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.ToTensor()
])

trainset = MYDATASET(
    mat_paths='/dataset/Train48d48.mat', transform=transform)

trainloader = torch.utils.data.DataLoader( trainset, batch_size=128, shuffle=True)