Pytorch customized data loader

Hello everyone, I have just started learning Pytorch and i got a problem while trying to create dataloader from my customized datasets which is contains 20 files “data_x.mat” stored in specific folder, I need to use them with data-loader, can anyone help me how to create this class to get iterable batches :
Mydata is

  • 10 files each is dict {“Training_Patches”: shape(760,120,21,21) / “Label”: shape(1,760)} stored in .mat file
  • 10 files each is dict {“Testing_Patches”: shape(420,120,21,21) / “Label”: shape(1,420)} stored in .mat file
    Here is what i thing it should looks like, Any suggestion,ideas, help would be Appreciated
class Datasets(Dataset):
    def __init__(self):
        self.tensors = []
        self.labels = []
       for i  in range(len(data_i["Training_patches"])):
                    self.tensors.append(data_i["Training_patches"])
                    self.labels.append(data_i["labels"])

    def __getitem__(self, index):
        # return one item on the index
        return

    def __len__(self):
        # return the data length
        return 

Would you like to load each file as one sample or is a whole batch saved in each file?
Were you able to load the data using scipy.io.loadmat?
Here is a dummy example for your dataset:


class MyDataset(Dataset):
    def __init__(self, mat_paths):
        self.paths = mat_paths
        
    def __getitem__(self, index):
        # Load .mat
        data = io.loadmat(self.paths[index])
        x = torch.from_numpy(data['Training_patches'])
        y = torch.from_numpy(data['labels'])
        
        return x, y
    
    def __len__(self):
        return len(self.paths)
2 Likes

Thanks ptrblck for replying , what i need is to load the whole files in the directory as one sampler to use it with the torch.Dataloader as follow :

Train_dst = MyDataset(path=path, train=True)
Test_dst = MyDataset(path=path, train=False)

TrainLoader = torch.utils.data.DataLoader(dataset=Train_dst, batch_size=50, shuffle=True )
TestLoader = torch.utils.data.DataLoader(dataset=Test_dst, batch_size=50, shuffle=True )

I’m not sure I understand your use case properly.
If each file should be a sample, you could only use a batch size of 10, as you only have 10 files.
Did my code sample work for you?

1 Like

I have a Image.mat file with 2656 image samples.
It has 4 key elements, “img” key contains the data with size(2656, 4097), the last column is the class label.
How can I build a dataset based on this .mat file?

I assume the data fits completely into memory?
If so, you could load the .mat file using scipy in your Dataset's __init__ method, get a single sample in __getitem__ by indexing and returning the tensor via torch.from_numpy.

Hi Ptrblck,

I wrote my custom data loader based on your guidance. The code is correct and work. The problem is that by changing the BatchSize, It does not work properly, every time it gives me data in 64 Batchsize regardless of the Batchsize number. Would you please help me with that?

workers=2
Batchsize=128

datarootNorm='//homeHealthytorchpatch_v1/'

datasetNorm=DatasetLoad(datarootNorm,transforms=transform,debug=False,ii=1)

dataloaderNormal = torch.utils.data.DataLoader(datasetNorm, batch_size=Batchsize,shuffle=True, num_workers=workers)


class DatasetLoad():

    def __init__(self, root_dirTrain,transforms, debug,ii):
        self.patches, self.labels = None, None
        self.ii=ii
        patch_path = os.path.join(root_dirTrain,'Healthyzone'+str(self.ii)+".csv")
        ## 15000x1x64x64
        self.patches=torch.load(patch_path) 
        # self.patches = scipy.io.loadmat(patch_path)['ImagesNormal']
        # patch_path = os.path.join(root_dirTrain,"NegTraining11_v1"+".mat")
        # self.patches = scipy.io.loadmat(patch_path)['NegTraining11_v1']
        self.transforms = transforms
        self.debug = debug

    def __getitem__(self, index):

        patchF = self.patches[index,:, : ,:]

        if self.transforms is not None:

            patchF = self.transforms(patchF)

        return patchF


    def __len__(self):
        return self.patches.shape[-1]

What shape does self.patches have? Is it [15000, 1, 64, 64]?
If so, I assume dim0==15000 are the number of samples?
If that’s also correct, I think you are returning the wrong __len__, as self.patches.shape[-1] would return 64 instead of 15000, so your max. batch size would be 64 and you would only load patches[:64].

1 Like

Many thanks let me check

yes you are right I correct it. I really appreciate.

Hi Ptrblck,

I am using spyder editor.
My code was working but now when I want to use data loader and get the data from that it gives me this error"
[SpyderKernelApp] WARNING | WARNING: attempted to send message from fork
{‘header’: {‘version’: ‘5.3’, ‘date’: datetime.datetime(2021, 3, 29, 6, 7, 4, 493062, tzinfo=datetime.timezone.utc), ‘session’: ‘5b99e49d-1c858153ee7c8a8ade190ea5’, ‘username’: ‘mom008’, ‘msg_type’: ‘comm_msg’, ‘msg_id’: ‘5b99e49d-1c858153ee7c8a8ade190ea5_84’}, ‘msg_id’: ‘5b99e49d-1c858153ee7c8a8ade190ea5_84’, ‘msg_type’: ‘comm_msg’, ‘parent_header’:"
The code is

for ii in range(1,7,1):
    
    datarootAbNorm= '//home/mom008//work/RoiClassification/radius_32_32/selectedROIs/zone_'+str(ii)+'/'
    
    datasetAbNorm=CustomDataSetZoneAbnormal(datarootAbNorm,transforms=transform,zone=ii)

    dataloaderAbnormal = torch.utils.data.DataLoader(datasetAbNorm, batch_size=1,shuffle=True, num_workers=workers)
    
    for i, data1 in enumerate(dataloaderAbnormal, 0):

Interactive interpreters (such as the one used in Spyder) don’t work well with multiprocessing, so you could use single workers while (live) developing in Spyder and add multiprocessing later (and execute the script from the terminal or as a standalone script in the IDE).

1 Like

I think the problem is for this part. I want to load the image which the name is in the defined list (not all). and with this command nothing for some times nothing is passed.

class CustomDataSetZoneAbnormal_HMScan():
    def __init__(self, main_dir, dirusedList,transforms,zone):
        self.main_dir = main_dir
        self.zone=zone
        self.transforms = transforms
        self.dirusedList=np.load(dirusedList)

        all_imgs = os.listdir(main_dir)
        self.total_imgs = sorted(all_imgs)
       


    def __getitem__(self, idx):
                    
        cc=self.total_imgs[idx]
        ccSplit= cc.split('.')
        ccSplitsec=ccSplit[-2].split('_')
        ccSplitTh=ccSplitsec[:-2]
        for ii in range(len(self.dirusedList)):

            if ccSplitTh==self.dirusedList[ii]:

                img_loc = os.path.join(self.main_dir, self.total_imgs[idx])

                image11=mpimg.imread(img_loc).astype(float)

                return image11
           
            else:
                continue    
    
    def __len__(self):
        return len(self.total_imgs)

Hello everyone, I have a Image .mat file with 60556 image samples. It has 2 key elements, “image_data” and “image_labels”. When I execute the following command, the result is as follows :

data = io.loadmat('/dataset/Train48d48.mat')
image_data = data['image_data']
image_labels = np.squeeze(data['image_labels'])
print(image_data.shape, image_labels.shape)

(60556, 48, 48, 3) (60556,)
i want build a dataloader. my code is :

class MYDATASET(Dataset):
    def __init__(self, mat_paths):
        self.mat_paths ='/dataset/Train48d48.mat'
        
    def __getitem__(self, index):

        data = io.loadmat(self.mat_paths[index])
        x = data['image_data']
        y = np.squeeze(data['image_labels'])
        x = torch.from_numpy(x)
        y = torch.from_numpy(y)       
        return x, y
    
    def __len__(self):
        return len(self.mat_paths)
trainset = MYDATASET(mat_paths='/dataset')
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

I was expecting after executing this command :

one_train_batch_imgs, one_train_batch_lbls = next(iter(trainloader))
print(one_train_batch_imgs.shape)

See the result below.
torch.Size([128, 3, 48, 48])
But I have an error like this : " FileNotFoundError: [Errno 2] No such file or directory: ‘d.mat’ "
Where is my mistake? Any suggestion,ideas, help would be Appreciated.

this path shouldn’t work in first place. You are trying to load something from the root subdirectory (add . in front to point to current dir)

Another thing is:

You are loading whole dataset every time single item is requested by the dataloader. Consider moving this part to __init__, otherwise it will bottleneck your loop.

1 Like

Thanks for your attention. But unfortunately I did not get a result.

Ah, I see.
First, I don’t think you can load specific element from *.mat file by providing [index]. You are loading whole dataset every time, that’s why I recommended to move loading part to __init__ to do that only once.

Second, self.mat_paths[index] in io.loadmat(self.mat_paths[index]) will be processed as a string and return you letter at position index. Since your path is '/dataset/Train48d48.mat', dataloader requested item with index 1 or 17, which is letter d, then loadmat function appended .mat as file extension.

It should be something like this

class MYDATASET(Dataset):
    def __init__(self, mat_paths):
        self.data = io.loadmat(mat_paths)
    def __getitem__(self, index):
        x = self.data['image_data'][index]
        y = np.squeeze(self.data['image_labels'])[index]
        x = torch.from_numpy(x)
        y = torch.from_numpy(y)       
        return x, y
    def __len__(self):
        return len(self.data['image_data'])
1 Like

thank you Sergey. You helped me a lot. I was able to achieve the result with a few changes. I will leave the code for those who may see it later.

class MYDATASET(data.Dataset):
    def __init__(self, mat_paths, transform=None):

        self.mat_paths = '/dataset/Train48d48.mat'
        self.data = io.loadmat(mat_paths)
        self.transform = transform

    def __getitem__(self, index):
        x = self.data['image_data'][index]
        y = np.squeeze(self.data['image_labels'])[index]

        if self.transform :
            x = self.transform(x)     
        return x, y

    def __len__(self):
        return len(self.data['image_data'])
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.ToTensor()
])

trainset = MYDATASET(
    mat_paths='/dataset/Train48d48.mat', transform=transform)

trainloader = torch.utils.data.DataLoader( trainset, batch_size=128, shuffle=True)