How can I sample from the EMNIST letters dataset?

avalon1511 · May 25, 2021, 8:38pm

Hi,

I am trying to create a smaller dataset from the EMNIST letters dataset by sampling x samples from each class of the dataset.

I’ve loaded the dataset using datasets.EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=transform, download=True)

I have tried using the built in sampler but I am not sure how to pass the indices of each class for sampling.

eduardo4jesus · May 25, 2021, 8:42pm

Try this

def _get_samples(self, n):
  samples = self.dataset.data[:n]
  if type(samples) is not torch.Tensor:
    samples = torch.from_numpy(samples)
  return samples

DataLoader.get_samples = _get_samples

avalon1511 · May 25, 2021, 8:53pm

Thanks! Correct me if I’m wrong, this will only sample the first n elements from the dataset right?

eduardo4jesus · May 25, 2021, 8:58pm

Yes. Also, this code snippet is adding the get_samples function to the DataLoader.

In your case, you might not need to instantiate a DataLoader. So, you could use only do the following …

from torchvision.datasets import EMNIST

emnist = EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=None, download=True).data.numpy()

indexes = np.arange(len(emnist))
np.random.shuffle(indexes)

samples = emnist[indexes[:N]]

This ways you get N random samples.

avalon1511 · May 25, 2021, 9:02pm

Thanks a ton for that! That’s actually what I currently have, just randomly sampling from the dataloader to get N samples but I need to change it such that I have equal number of samples from each class in the dataset.

For instance, when I randomly sample say 3000 samples, it is distributed unevenly among the 26 classes of EMNIST letters. What I need to is the following:

In order to sample say 2600 samples, I want 100 random samples from each class and not just 2600 samples overall.

eduardo4jesus · May 25, 2021, 9:06pm

I see, my bad. Just noticed that.

This should help, then.

from torchvision.datasets import EMNIST

emnist = EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=None, download=True)

def sample_class(class_id, N):
  selection = emnist.data[emnist.targets == class_id]
  indexes = np.arange(len(selection))
  np.random.shuffle(indexes)

  return selection[indexes[:N]]

avalon1511 · May 25, 2021, 11:51pm

Thanks! I actually also figured it out . I’m doing this: How to get a part of datasets? - #5 by ptrblck

Although now I am getting an error downloading the EMNIST file and I’m not sure why - this is my error

Downloading and extracting zip archive                                                                                              
Using downloaded and verified file: ./emnist_data/EMNIST/raw/emnist.zip                                                             
Extracting ./emnist_data/EMNIST/raw/emnist.zip to ./emnist_data/EMNIST/raw                                                         
Traceback (most recent call last):                                                                                                     
File "GAN_test.py", line 57, in <module>                                                                                              
target_dataset = datasets.EMNIST(root='./emnist_data/', split = 'letters', train=True,
transform=transform, download=True)  
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 237, in __init__                                           super(EMNIST, self).__init__(root, **kwargs)                                                                                       
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 68, in __init__                                            self.download()                                                                                                                   
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 260, in download                                           remove_finished=True)                                                                                                              
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/utils.py", line 252, in download_and_extract_archive                       extract_archive(archive, extract_root, remove_finished)                                                                            File "/usr/lib/python2.7/dist-packages/torchvision/datasets/utils.py", line 231, in extract_archive                                    with zipfile.ZipFile(from_path, 'r') as z:                                                                                         
File "/usr/lib/python2.7/zipfile.py", line 793, in __init__                                                                            self._RealGetContents()                                                                                                            
File "/usr/lib/python2.7/zipfile.py", line 834, in _RealGetContents                                                                   
raise BadZipfile, "File is not a zip file"                                                                                      
zipfile.BadZipfile: File is not a zip file

Not sure why this is happening, I’ve been downloading this dataset this way for months

ptrblck · May 26, 2021, 12:00am

Could you check the size of the emnist.zip file and also see, if you could manually unzip it?
I guess the download might have failed, so you could try to re-download the file.

avalon1511 · May 26, 2021, 12:08am

I’ve manually unzipped the file after downloading it from Kaggle but how do I only use the ‘letters’ portion? do I just pass the train_letters.pt file in my code?

eduardo4jesus · May 26, 2021, 8:57pm

Would you like to share your full solution here, so it is ready to use in case someone else need something similar?

avalon1511 · June 1, 2021, 5:40pm

This is how I’m sampling equally from each class of the dataset

def _create_samples(dataset, num_classes):

        N = int(np.ceil(k_samp / num_classes)) # k_samp is the number of total samples I need
        indices = np.arange(len(dataset))
        train_indices, test_indices = train_test_split(indices, train_size = N * num_classes , stratify = dataset.targets)

        # Warp into Subsets and DataLoaders
        
        train_dataset = torch.utils.data.Subset(target_dataset, train_indices)

        return train_dataset