Saving split dataset

We can divide a dataset by means of torch.utils.data.random_split. However, for reproduction of the results, is it possible to save the split datasets to load them later?

You could use a seed for the random number generator (torch.manual_seed) and make sure the split is the same every time.
Alternatively, you could split the sample indices, store each index tensor locally via torch.save, and use it in Subset.

2 Likes

Thank you ptrblck for great answer, as always. It does work but I stumbled upon a strange issue.

I split my training set into training and validation set using a deterministic seed as mentioned:

torch.manual_seed(0)
train_dataset, val_dataset = torch.utils.data.random_split(trainval_dataset, [train_size, val_size])

I wanted to test the CNN then on a validation set (using torchvision CIFAR10). When I test it on a testset, the accuracy is always the same as expected. However, when I test it on validation set, the accuracy changes. When delving into the code, I realized the problem is not with the validation set (whose targets are the same every time I run an instance of a script) but actually the network spits different outputs.

How is it possible? Especially, that it appears when feeding the validation set but not a test set?

Are you calling model.eval() when checking the accuracy on the validation set?

Yes, I am. I also just tried model.train() for the sake of completeness and the same erratic behavior happens.

To understand the issue completely: you are calling model.eval() and check the accuracy on the validation set. The validation set indices (passed to Subset) are definitely the same, but the accuracy changes for sequential runs?

Precisely. Here are examples of two different runs. I show the output of the same batch and print respectively:

   print(predicted.eq(targets).sum().item())
   print(predicted.eq(targets))
   print(predicted)
   print(targets)

Targets are the same, but predicted values slightly vary.

Run 1:


118
tensor([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1], device=‘cuda:0’, dtype=torch.uint8)
tensor([4, 8, 1, 0, 2, 4, 3, 4, 5, 6, 4, 4, 5, 7, 3, 2, 4, 5, 1, 7, 9, 9, 7, 9,
2, 4, 1, 1, 4, 8, 2, 9, 7, 6, 9, 1, 2, 9, 1, 1, 5, 1, 7, 7, 9, 4, 3, 3,
4, 6, 0, 5, 5, 5, 7, 7, 0, 7, 0, 4, 7, 3, 6, 1, 4, 0, 4, 0, 3, 1, 4, 8,
7, 6, 3, 7, 0, 5, 2, 0, 8, 5, 0, 2, 9, 7, 2, 2, 2, 8, 9, 6, 1, 1, 9, 1,
4, 9, 4, 8, 7, 6, 4, 7, 7, 8, 0, 6, 7, 4, 7, 5, 8, 3, 1, 3, 9, 8, 5, 8,
4, 2, 3, 7, 7, 2, 5, 1], device=‘cuda:0’)
tensor([4, 8, 1, 2, 2, 4, 3, 4, 5, 6, 4, 4, 5, 7, 5, 2, 4, 5, 1, 7, 1, 9, 7, 9,
2, 4, 1, 1, 4, 8, 2, 9, 7, 6, 9, 1, 2, 9, 1, 1, 5, 1, 7, 7, 9, 4, 3, 3,
4, 6, 0, 5, 5, 5, 7, 7, 0, 7, 0, 4, 7, 3, 6, 1, 5, 0, 4, 0, 3, 1, 4, 8,
7, 6, 3, 7, 2, 5, 3, 0, 8, 5, 0, 2, 9, 7, 2, 3, 2, 8, 9, 6, 1, 1, 9, 1,
4, 9, 4, 8, 7, 5, 4, 7, 7, 8, 0, 6, 7, 4, 7, 5, 8, 6, 1, 3, 1, 8, 5, 8,
4, 2, 3, 7, 7, 2, 5, 1], device=‘cuda:0’)

Run 2


121
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1], device=‘cuda:0’, dtype=torch.uint8)
tensor([4, 8, 1, 2, 2, 4, 3, 4, 5, 6, 4, 4, 5, 7, 3, 2, 4, 5, 1, 7, 1, 9, 7, 9,
2, 4, 1, 1, 4, 8, 2, 9, 7, 6, 9, 1, 2, 9, 1, 1, 5, 1, 7, 7, 9, 4, 3, 3,
4, 6, 0, 5, 5, 3, 7, 7, 0, 7, 0, 4, 7, 3, 6, 1, 5, 0, 4, 0, 3, 1, 4, 8,
7, 6, 3, 7, 2, 5, 3, 0, 0, 5, 0, 2, 9, 7, 2, 2, 2, 8, 9, 6, 1, 1, 9, 1,
4, 9, 4, 8, 4, 5, 4, 7, 7, 8, 0, 6, 7, 4, 7, 5, 8, 3, 1, 4, 1, 8, 5, 8,
4, 2, 3, 7, 7, 2, 5, 1], device=‘cuda:0’)
tensor([4, 8, 1, 2, 2, 4, 3, 4, 5, 6, 4, 4, 5, 7, 5, 2, 4, 5, 1, 7, 1, 9, 7, 9,
2, 4, 1, 1, 4, 8, 2, 9, 7, 6, 9, 1, 2, 9, 1, 1, 5, 1, 7, 7, 9, 4, 3, 3,
4, 6, 0, 5, 5, 5, 7, 7, 0, 7, 0, 4, 7, 3, 6, 1, 5, 0, 4, 0, 3, 1, 4, 8,
7, 6, 3, 7, 2, 5, 3, 0, 8, 5, 0, 2, 9, 7, 2, 3, 2, 8, 9, 6, 1, 1, 9, 1,
4, 9, 4, 8, 7, 5, 4, 7, 7, 8, 0, 6, 7, 4, 7, 5, 8, 6, 1, 3, 1, 8, 5, 8,
4, 2, 3, 7, 7, 2, 5, 1], device=‘cuda:0’)

Were you able to exactly reproduce the same model parameters for your runs?
I.e. did you compare the state_dicts before running the validation loop?
Even if you are seeding and getting the same data samples for your runs, the result might still differ, e.g. due to cudnn as described in the Reproducibility docs.

I load the same pre-trained model in both cases. In the case of running this model on the test set, the output is always the same. Do you think there still can be any room for such nondeterministic behavior?

Did you follow the advice from the reproducibility docs?
Are you using and random transformations in your Dataset for the validation set?