How randomised data before split the dataset to train,valid,test

Neda · March 6, 2019, 8:25am

In my dataset, in the acquisition time, the name of images is sorted based on their condition. For example the first three images belongs to the same patient with different heartbeat. and the next three images are belong to the another patient …

I’m not sure how can I randomised cases to the train:validation:test. so my test set will be truly independent?
At the moment is snippet for splitting which is not randomize the data.

folder_data = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\imagesResized\\*.png")
folder_mask = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\labelsResized\\*.png") #labels_2nd_Resized_Binary

# split these path using a certain percentage
len_data = len(folder_data)
print("count of dataset: ", len_data)
# count of dataset:  992


split_1 = int(0.6 * len(folder_data))
split_2 = int(0.8 * len(folder_data))

#folder_data.sort()

train_image_paths = folder_data[:split_1]
print("count of train images is: ", len(train_image_paths)) 

valid_image_paths = folder_data[split_1:split_2]
print("count of validation image is: ", len(valid_image_paths))

test_image_paths = folder_data[split_2:]
print("count of test images is: ", len(test_image_paths)) 


train_mask_paths = folder_mask[:split_1]
valid_mask_paths = folder_mask[split_1:split_2]
test_mask_paths = folder_mask[split_2:]

train_dataset = CustomDataset(train_image_paths, train_mask_paths)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=2)

valid_dataset = CustomDataset(valid_image_paths, valid_mask_paths)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=1, shuffle=True, num_workers=2)

test_dataset = CustomDataset(test_image_paths, test_mask_paths)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=2)  

dataLoaders = {
        'train': train_loader,
        'valid': valid_loader,
         'test': test_loader,
        }

if I add random.shuffle() the files will sort randomly but how can I do the same for mask folder as train dataset should have correspond mask in there?

import os, random
I = os.listdir('D:/Neda/Pytorch/U-net/my_data/imagesResized')
random.shuffle(I)
print(I)

JuanFMontesinos · March 6, 2019, 11:38am

Hi,
You can zip files and mask such that you apply the same shuffling.

I = os.listdir('D:/Neda/Pytorch/U-net/my_data/imagesResized')
M = os.listdir(path_to_masks)
pair_MI = zip(I,M)
random.shuffle(pair_MI)

WRT splitting dataset, I see you are using U-Net, thus I assume you are segmenting something in medical images.
Do you have several classes? Do you have a single one?

If you have several classes you have to try to ensure that your validation and split set contains the same proportion of samples than the training (even if this condition is not always satisfied). You should ensure that images from one patient don’t belong to training and val/test set, as they may be similar and wouldn’t be a proper validation set.

In short, create a val/test set with patients never seen in the training set. As far as I know, in medical imaging results may overfit to the machines data was acquired with. If you data comes from several machines try also to shuffle it.

Regards
Juan

Neda · March 10, 2019, 2:26pm

@JuanFMontesinos Thank you for the comment. Good points. Yes, I work on echocardiography images. It does have two classes. At the moment U-net is giving me reasonable accuracy. I also tried three different models, but U-net is the winner.