How to split test and train data keeping equal proportions of each class?

somnath · July 12, 2018, 12:59pm

Suppose I have a dataset with the following classes:

Class A: 3000 items
Class B: 1000 items
Class C: 2000 items

I want to split this dataset in two parts so that there are 25% data in test set. However, how can I do this so that equal percentage of each class is present in the test set? These items should be randomly selected. For e.g., the test data should be like the following:

Class A: 750 items
Class B: 250 items
Class C: 500 items

Pfaeff · July 12, 2018, 1:44pm

Make a list for each class, take 25% at random from each list, combine the lists and shuffle.

somnath · July 12, 2018, 1:56pm

Is there any PyTorch code for the same? Or do we need to do it using some other library.

I am asking for PyTorch code for performance reasons in case of large datasets.

Pfaeff · July 12, 2018, 2:20pm

You could keep lists of pairs of filenames and labels and prepare batches asynchronously in a background worker thread. That should be efficient even when training with millions of images. You might wanna take a look at the torch.utils.data.dataset.Dataset and torch.utils.data.sampler.Sampler classes that you can use in conjunction with the torch.utils.data.DataLoader.

ptrblck · July 12, 2018, 5:35pm

If you load the dataset completely before passing it to the Dataset and DataLoader classes, you could use scikit-learn’s train_test_split with the stratified option.

somnath · July 12, 2018, 6:25pm

In that case, will it be possible to use something like num_workers while loading?

ptrblck · July 12, 2018, 6:36pm

This would split the dataset before using any of the PyTorch classes.
You would get different splits and create different Dataset classes:


X = np.random.randn(1000, 2)
y = np.random.randint(0, 10, size=1000)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, stratify=y)
np.unique(y_train, return_counts=True)
np.unique(y_val, return_counts=True)

train_dataset = Dataset(X_train, y_train, ...)
train_loader = DataLoader(train_dataset, ...)

The DataLoader is thus completely unaffected by this and you can use num_workers as you wish.
If you need lazy loading, @Pfaeff had a good approach.

somnath · July 12, 2018, 6:50pm

Understood, thanks a lot.

sampa · November 13, 2019, 5:15am

I also have a same question. How to solve this issue?

ptrblck · November 13, 2019, 6:10am

Is the mentioned train_test_split not working for your use case?

sampa · November 13, 2019, 9:43am

I have image data. I want to split the image data such a way that class proportion is maintained.

ptrblck · November 13, 2019, 5:54pm

Are you able to get all the targets without loading the actual data?
If so, you could use train_test_split passing the indices and targets, and use the index splits for your Subsets.

sampa · November 14, 2019, 1:37am

The problem is Solved. Thanks

saba · February 25, 2020, 2:21am

Dear Ptrblck,

I create the splited data for validation and training, Now I want to pass it to the data loader. I am confused by knowing how to make my data to pass to the data loader indeed I don’t know how to make DatasetTrain and DatasetValid I applied this:

    [train_D, valid_D,train_L,valid_L]= train_test_split(WholeData.numpy(),WholeTargetArray, test_size=0.2,train_size=0.8, shuffle=True, stratify=WholeTargetArray)
    DatasetTrain=Dataset(train_D,train_L)???????
    DatasetValid=Dataset(valid_D,valid_L)?????
trainloader=torch.utils.data.DataLoader(DatasetTrain, batch_size=32,shuffle=True,drop_last=True, num_workers=0)
 validationloader=torch.utils.data.DataLoader(DatasetValid, batch_size=6, drop_last=True,num_workers=0)

saba · February 25, 2020, 4:47am

It is my Final attempt:

    [train_D, valid_D,train_L,valid_L]= train_test_split(WholeData.numpy(),WholeTargetArray, test_size=0.2,train_size=0.8, shuffle=True, stratify=WholeTargetArray)
   
    DatasetTrain = TensorDataset(torch.from_numpy(train_D),torch.from_numpy(train_L))
        
    DatasetValid=TensorDataset(torch.from_numpy(valid_D),torch.from_numpy(valid_L))

    trainloader=torch.utils.data.DataLoader(DatasetTrain,batch_size=32,shuffle=True,drop_last=True, num_workers=0)

    validationloader=torch.utils.data.DataLoader(DatasetValid, batch_size=6, drop_last=True,num_workers=0)

I think it is correct. What do you think?

ptrblck · February 25, 2020, 4:59am

This approach looks correct.

saba · February 25, 2020, 5:41am

This code works very good for me. and my data is balanced

saba · February 26, 2020, 2:45am

Hi Ptrblck,

Sorry, for finding the best number of epoch, I use the list of validation loss from all epochs. and then convert the list to the numpy array to find the minimum loss and index and get the optimum epoch there.
with the CPU it works good but when I run the code with GPU it give me error, the code is

val_lossesArray=np.asarray(val_losses)
   vvv2=torch.from_numpy(val_lossesArray)
   result = np.where(val_lossesArray == np.amin(val_lossesArray))
   vv1=result[-1]
   EpochFinal=vv1[0]
   print("best epoch",Epoc) ```


and the error is ( result = np.where(val_lossesArray == np.amin(val_lossesArray))
TypeError: eq() received an invalid combination of arguments - got (numpy.ndarray), but expected one of:
* (Tensor other)
     didn't match because some of the arguments have invalid types: (!numpy.ndarray!)
* (Number other)
     didn't match because some of the arguments have invalid types: (!numpy.ndarray!)
)

saba · February 26, 2020, 3:52am

I tried this one too but does not work

    val_lossestensor=torch.from_numpy(val_lossesArray)
    VV2,Index=torch.min(val_lossestensor)

VV2,Index=torch.min(val_lossestensor)

File “/apps/pytorch/1.2.0-py36-cuda90/lib/python3.6/site-packages/torch/tensor.py”, line 384, in iter
raise TypeError(‘iteration over a 0-d tensor’)

nicaushaz · July 8, 2021, 6:17pm

Hi @ptrblck I was wondering is there any other way to do this now? A torch approach, instead of reading a dataframe doing a train test split and then creating 3 dataloaders and 3 datasets for train/val/split?

Thank you in advance.