Weird Behavior of SubsetRandomSampler

Filos92 · January 10, 2020, 1:37pm

Good afternoon everyone,
my neural network shows a behavior I can’t understand. I have already found out the source of the error, it is the SubsetRandomSample. Maybe you can help me to understand

When is create my Dataloader like this:

Dataset_val   = Dataset(transform=Augmentation.transformation['val'])
Dataset_train   = Dataset(transform=Augmentation.transformation['train'])

dataset_size = len(self.Dataset_train)
indices = list(range(dataset_size))
split = int(np.floor(test_size * dataset_size))
shuffle_dataset =True
        if shuffle_dataset :
            np.random.seed(random_seed)
            np.random.shuffle(indices)
        
train_indices, val_indices = indices[split:], indices[:split]
train_sampler = SequentialSampler(train_indices)
valid_sampler = SequentialSampler(val_indices)
       
self.train_loader = torch.utils.data.DataLoader(Dataset_train,
                                                        batch_size= 64, 
                                                        sampler=train_sampler,
                                                        )
        
self.validation_loader = torch.utils.data.DataLoader(self.Dataset_val, 
                                                             batch_size=64,
                                                             sampler=valid_sampler,
                                                             )

Everything works fine.!!
Figure_2|668x500
But if I do:

Dataset_val   = Dataset(transform=Augmentation.transformation['val'])
Dataset_train   = Dataset(transform=Augmentation.transformation['train'])

dataset_size = len(self.Dataset_train)
indices = list(range(dataset_size))
split = int(np.floor(test_size * dataset_size))

        
train_indices, val_indices = indices[split:], indices[:split]
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)
       
self.train_loader = torch.utils.data.DataLoader(Dataset_train,
                                                        batch_size= 64, 
                                                        sampler=train_sampler,
                                                        )
        
self.validation_loader = torch.utils.data.DataLoader(self.Dataset_val, 
                                                             batch_size=64,
                                                             sampler=valid_sampler,
                                                             )

Figure_1|668x500
the training loss acts exactly like before, but the validation los stops decreasing almost immediately

I’ve tried to find any differences between the data, by looking at small Datasets of 5 to 10 datapoints, passed to the training or validation process, but in my opinion the same data, with the same type and size is passed

is there something wrong with my code, or do i missunderstand something completely wrong?
Best Greetings,
Filos92

ptrblck · January 10, 2020, 7:54pm

It seems the difference between both codes is the manual shuffling of the indices?

How did you define your Dataset? Could it be that the classes are sorted internally?
If so, the second approach would pass only the first classes to the train_sampler and the last classes to the valid_sampler, which could explain your observation.

Filos92 · January 13, 2020, 9:31am

Yes, the difference is the shuffling.
My Dataset looks like this:

class Dataset():

    def __init__(self, Path):
        
        input_values = pd.read_csv(Path, header = 0 , sep=';')
        data = input_values[['position_x','position_y','position_z']].values.tolist()
        target =input_values['req_pos'].values.tolist()

        scaler = StandardScaler()

        self.data = scaler.fit_transform(data)
        self.target =  target
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        data=self.data[index]
        target=self.target[index]
        
        values = data, target
        
        return values

I don’t use classes, because I look at a regression based problem. In general I would say all entries are equal, so I don’t think I pass the wrong classes

Filos92 · January 22, 2020, 7:41am

I do not use the RandomSampler any more, but my question is still open