How to split test and train data keeping equal proportions of each class?

Suppose I have a dataset with the following classes:

Class A: 3000 items
Class B: 1000 items
Class C: 2000 items

I want to split this dataset in two parts so that there are 25% data in test set. However, how can I do this so that equal percentage of each class is present in the test set? These items should be randomly selected. For e.g., the test data should be like the following:

Class A: 750 items
Class B: 250 items
Class C: 500 items

2 Likes

Make a list for each class, take 25% at random from each list, combine the lists and shuffle.

Is there any PyTorch code for the same? Or do we need to do it using some other library.

I am asking for PyTorch code for performance reasons in case of large datasets.

You could keep lists of pairs of filenames and labels and prepare batches asynchronously in a background worker thread. That should be efficient even when training with millions of images. You might wanna take a look at the torch.utils.data.dataset.Dataset and torch.utils.data.sampler.Sampler classes that you can use in conjunction with the torch.utils.data.DataLoader.

2 Likes

If you load the dataset completely before passing it to the Dataset and DataLoader classes, you could use scikit-learn’s train_test_split with the stratified option.

2 Likes

In that case, will it be possible to use something like num_workers while loading?

This would split the dataset before using any of the PyTorch classes.
You would get different splits and create different Dataset classes:


X = np.random.randn(1000, 2)
y = np.random.randint(0, 10, size=1000)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, stratify=y)
np.unique(y_train, return_counts=True)
np.unique(y_val, return_counts=True)

train_dataset = Dataset(X_train, y_train, ...)
train_loader = DataLoader(train_dataset, ...)

The DataLoader is thus completely unaffected by this and you can use num_workers as you wish.
If you need lazy loading, @Pfaeff had a good approach. :wink:

11 Likes

Understood, thanks a lot.

I also have a same question. How to solve this issue?

Is the mentioned train_test_split not working for your use case?

I have image data. I want to split the image data such a way that class proportion is maintained.

Are you able to get all the targets without loading the actual data?
If so, you could use train_test_split passing the indices and targets, and use the index splits for your Subsets.

2 Likes

The problem is Solved. Thanks

Dear Ptrblck,

I create the splited data for validation and training, Now I want to pass it to the data loader. I am confused by knowing how to make my data to pass to the data loader indeed I don’t know how to make DatasetTrain and DatasetValid I applied this:

    [train_D, valid_D,train_L,valid_L]= train_test_split(WholeData.numpy(),WholeTargetArray, test_size=0.2,train_size=0.8, shuffle=True, stratify=WholeTargetArray)
    DatasetTrain=Dataset(train_D,train_L)???????
    DatasetValid=Dataset(valid_D,valid_L)?????
trainloader=torch.utils.data.DataLoader(DatasetTrain, batch_size=32,shuffle=True,drop_last=True, num_workers=0)
 validationloader=torch.utils.data.DataLoader(DatasetValid, batch_size=6, drop_last=True,num_workers=0)

It is my Final attempt:

    [train_D, valid_D,train_L,valid_L]= train_test_split(WholeData.numpy(),WholeTargetArray, test_size=0.2,train_size=0.8, shuffle=True, stratify=WholeTargetArray)
   
    DatasetTrain = TensorDataset(torch.from_numpy(train_D),torch.from_numpy(train_L))
        
    DatasetValid=TensorDataset(torch.from_numpy(valid_D),torch.from_numpy(valid_L))

    trainloader=torch.utils.data.DataLoader(DatasetTrain,batch_size=32,shuffle=True,drop_last=True, num_workers=0)

    validationloader=torch.utils.data.DataLoader(DatasetValid, batch_size=6, drop_last=True,num_workers=0)

I think it is correct. What do you think?

1 Like

This approach looks correct.

1 Like

This code works very good for me. and my data is balanced :slight_smile:

1 Like

Hi Ptrblck,

Sorry, for finding the best number of epoch, I use the list of validation loss from all epochs. and then convert the list to the numpy array to find the minimum loss and index and get the optimum epoch there.
with the CPU it works good but when I run the code with GPU it give me error, the code is :slight_smile:

val_lossesArray=np.asarray(val_losses)
   vvv2=torch.from_numpy(val_lossesArray)
   result = np.where(val_lossesArray == np.amin(val_lossesArray))
   vv1=result[-1]
   EpochFinal=vv1[0]
   print("best epoch",Epoc) ```


and the error is ( result = np.where(val_lossesArray == np.amin(val_lossesArray))
TypeError: eq() received an invalid combination of arguments - got (numpy.ndarray), but expected one of:
* (Tensor other)
     didn't match because some of the arguments have invalid types: (!numpy.ndarray!)
* (Number other)
     didn't match because some of the arguments have invalid types: (!numpy.ndarray!)
)

I tried this one too but does not work

    val_lossestensor=torch.from_numpy(val_lossesArray)
    VV2,Index=torch.min(val_lossestensor)
VV2,Index=torch.min(val_lossestensor)

File “/apps/pytorch/1.2.0-py36-cuda90/lib/python3.6/site-packages/torch/tensor.py”, line 384, in iter
raise TypeError(‘iteration over a 0-d tensor’)

Hi @ptrblck I was wondering is there any other way to do this now? A torch approach, instead of reading a dataframe doing a train test split and then creating 3 dataloaders and 3 datasets for train/val/split?

Thank you in advance.