Custom dataset for hyperspectral data (.mat file)

Riko · June 17, 2019, 4:19pm

Hi! I am trying to make custom dataset for HSI (HxWxD) and groundtruth (HxW) data stored in .mat file.
Which i would like to use in:
train_loader = torch.utils.data.DataLoader()
I found class like:
from torch.utils.data.dataset import Dataset
class MyCustomDataset(Dataset):
def init(self, …):
# stuff
def getitem(self, index):
# stuff
return (img, label)
def len(self):
return count
To make further 3D CNN mdel for classification.
Could you give me some help or examples for this problem of mine?
Or is this even necessary for making custom data class for cnn?
Thank you in advance!

ptrblck · June 17, 2019, 9:56pm

You could pass the paths to your Dataset's __init__ function and lazily load each sample in __getitem__.
Since your data is stored in .mat files, you could try to use scipy.io.loadmat to load it.

Riko · June 18, 2019, 6:38pm

Ty for reply. I came up with something like this:

 
from torch.utils.data.dataset import Dataset
import scipy.io as io
 class MyDataset(Dataset):
   def __init__(self, mat_path, gt_path):      
       data = io.loadmat(mat_path)['pavia']
       data = numpy.uint8(data)
       self.images = torch.from_numpy(data)
       data_gt = io.loadmat(gt_path)
       self.target = torch.from_numpy(data_gt['pavia_gt'])            
 
  def __getitem__(self, index):
       x = self.images[index]
       y = self.target[index]
       return x, y
     
   def __len__(self):
       return len(self.images)

mat_path = ('./data/Pavia.mat')
gt_path = ('./data/Pavia_gt.mat')
custom_dataset = MyDataset(mat_path, gt_path)
train_loader = torch.utils.data.DataLoader(dataset=custom_dataset,
                                           batch_size=64, 
                                           shuffle=True)

Now i have no idea how to use this in torch.utils.data
Or should i make a Train/Test Set split first?

ptrblck · June 18, 2019, 7:25pm

You could perform the splitting beforehand.
Based on your code, it seems that using a torch.utils.data.Subset would be easier, as you would only have to provide the corresponding indices.

You can just iterate the DataLoader:

for data, target in train_loader:
    data = data.to(device)
    target =  target.to(device)
    optimizer.zero_grad()
    output = model(data)
    ...

It will automatically create the batches, shuffle your Dataset, and use multiprocessing to load the data (if num_workers > 0).