Stratified split of data set

richukuttan · October 11, 2017, 11:41pm

Hi,

I know that most people prefer to create separate data sets for training and testing. However, can we perform a stratified split on a data set? By ‘stratified split’, I mean that if I want a 70:30 split on the data set, each class in the set is divided into 70:30 and then the first part is merged to create data set 1 and the second part is merged to create data set 2. While I have seen random splits (like kevinzakka’s script ), I have not seen an example of stratified split yet.

Continuing this, is there a way to access all elements of a single class once the data has been dumped to a data set?

Thank you,
Richukuttan

QuantScientist · October 12, 2017, 7:05am

look here:

github.com

QuantScientist/Deep-Learning-Boot-Camp/blob/master/day02-PyTORCH-and-PyCUDA/PyTorch/21-PyTorch-CIFAR-10-Custom-data-loader-from-scratch.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Deep Learning Bootcamp November 2017, GPU Computing for Data Scientists\n",
    "\n",
    "<img src=\"../images/bcamp.png\" align=\"center\">\n",
    "\n",
    "## 21 PyTorch CIFAR-10 Custom data loader from scratch\n",
    "\n",
    "Web: https://www.meetup.com/Tel-Aviv-Deep-Learning-Bootcamp/events/241762893/\n",
    "\n",
    "Notebooks: <a href=\"https://github.com/QuantScientist/Data-Science-PyCUDA-GPU\"> On GitHub</a>\n",
    "\n",

This file has been truncated. show original

Section named “train validation split”

richukuttan · October 14, 2017, 12:39am

Please explain how this ensures we get equal split of the examples of each class. From what I understood, the only way that is possible is by carefully arranging the entries at the csv file. If I randomize the entries, the output also becomes randomized. So, it is within the realm of possibility that a particular class gets no training examples, as all of its examples lie after the split point. If it is indeed dependent on the csv file, it may be much easier to just create 2 folders.

Also, can you explain how to access all elements of a given class after it has been dumped to the dataset (preferably by ImageFolder)

smth · October 14, 2017, 7:42am

This depends on the dataset, look at the source code to figure this out: https://github.com/pytorch/vision/blob/master/torchvision/datasets/folder.py