How can I sample from the EMNIST letters dataset?

Hi,

I am trying to create a smaller dataset from the EMNIST letters dataset by sampling x samples from each class of the dataset.

I’ve loaded the dataset using datasets.EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=transform, download=True)

I have tried using the built in sampler but I am not sure how to pass the indices of each class for sampling.

Try this

def _get_samples(self, n):
  samples = self.dataset.data[:n]
  if type(samples) is not torch.Tensor:
    samples = torch.from_numpy(samples)
  return samples

DataLoader.get_samples = _get_samples

Thanks! Correct me if I’m wrong, this will only sample the first n elements from the dataset right?

Yes. Also, this code snippet is adding the get_samples function to the DataLoader.

In your case, you might not need to instantiate a DataLoader. So, you could use only do the following …

from torchvision.datasets import EMNIST

emnist = EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=None, download=True).data.numpy()

indexes = np.arange(len(emnist))
np.random.shuffle(indexes)

samples = emnist[indexes[:N]]

This ways you get N random samples.

Thanks a ton for that! That’s actually what I currently have, just randomly sampling from the dataloader to get N samples but I need to change it such that I have equal number of samples from each class in the dataset.

For instance, when I randomly sample say 3000 samples, it is distributed unevenly among the 26 classes of EMNIST letters. What I need to is the following:

In order to sample say 2600 samples, I want 100 random samples from each class and not just 2600 samples overall.

I see, my bad. Just noticed that.

This should help, then.

from torchvision.datasets import EMNIST

emnist = EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=None, download=True)

def sample_class(class_id, N):
  selection = emnist.data[emnist.targets == class_id]
  indexes = np.arange(len(selection))
  np.random.shuffle(indexes)

  return selection[indexes[:N]]

Thanks! I actually also figured it out . I’m doing this: How to get a part of datasets? - #5 by ptrblck

Although now I am getting an error downloading the EMNIST file and I’m not sure why - this is my error

Downloading and extracting zip archive                                                                                              
Using downloaded and verified file: ./emnist_data/EMNIST/raw/emnist.zip                                                             
Extracting ./emnist_data/EMNIST/raw/emnist.zip to ./emnist_data/EMNIST/raw                                                         
Traceback (most recent call last):                                                                                                     
File "GAN_test.py", line 57, in <module>                                                                                              
target_dataset = datasets.EMNIST(root='./emnist_data/', split = 'letters', train=True,
transform=transform, download=True)  
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 237, in __init__                                           super(EMNIST, self).__init__(root, **kwargs)                                                                                       
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 68, in __init__                                            self.download()                                                                                                                   
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 260, in download                                           remove_finished=True)                                                                                                              
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/utils.py", line 252, in download_and_extract_archive                       extract_archive(archive, extract_root, remove_finished)                                                                            File "/usr/lib/python2.7/dist-packages/torchvision/datasets/utils.py", line 231, in extract_archive                                    with zipfile.ZipFile(from_path, 'r') as z:                                                                                         
File "/usr/lib/python2.7/zipfile.py", line 793, in __init__                                                                            self._RealGetContents()                                                                                                            
File "/usr/lib/python2.7/zipfile.py", line 834, in _RealGetContents                                                                   
raise BadZipfile, "File is not a zip file"                                                                                      
zipfile.BadZipfile: File is not a zip file

Not sure why this is happening, I’ve been downloading this dataset this way for months :confused:

Could you check the size of the emnist.zip file and also see, if you could manually unzip it?
I guess the download might have failed, so you could try to re-download the file.

I’ve manually unzipped the file after downloading it from Kaggle but how do I only use the ‘letters’ portion? do I just pass the train_letters.pt file in my code?

Would you like to share your full solution here, so it is ready to use in case someone else need something similar?

This is how I’m sampling equally from each class of the dataset

def _create_samples(dataset, num_classes):

        N = int(np.ceil(k_samp / num_classes)) # k_samp is the number of total samples I need
        indices = np.arange(len(dataset))
        train_indices, test_indices = train_test_split(indices, train_size = N * num_classes , stratify = dataset.targets)

        # Warp into Subsets and DataLoaders
        
        train_dataset = torch.utils.data.Subset(target_dataset, train_indices)

        return train_dataset