Hi,
I am trying to create a smaller dataset from the EMNIST letters dataset by sampling x
samples from each class of the dataset.
I’ve loaded the dataset using datasets.EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=transform, download=True)
I have tried using the built in sampler but I am not sure how to pass the indices of each class for sampling.
Try this
def _get_samples(self, n):
samples = self.dataset.data[:n]
if type(samples) is not torch.Tensor:
samples = torch.from_numpy(samples)
return samples
DataLoader.get_samples = _get_samples
Thanks! Correct me if I’m wrong, this will only sample the first n
elements from the dataset right?
Yes. Also, this code snippet is adding the get_samples
function to the DataLoader
.
In your case, you might not need to instantiate a DataLoader. So, you could use only do the following …
from torchvision.datasets import EMNIST
emnist = EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=None, download=True).data.numpy()
indexes = np.arange(len(emnist))
np.random.shuffle(indexes)
samples = emnist[indexes[:N]]
This ways you get N random samples.
Thanks a ton for that! That’s actually what I currently have, just randomly sampling from the dataloader to get N samples but I need to change it such that I have equal number of samples from each class in the dataset.
For instance, when I randomly sample say 3000 samples, it is distributed unevenly among the 26 classes of EMNIST letters. What I need to is the following:
In order to sample say 2600 samples, I want 100 random samples from each class and not just 2600 samples overall.
I see, my bad. Just noticed that.
This should help, then.
from torchvision.datasets import EMNIST
emnist = EMNIST(root='./emnist_data/', split = 'letters', train=True, transform=None, download=True)
def sample_class(class_id, N):
selection = emnist.data[emnist.targets == class_id]
indexes = np.arange(len(selection))
np.random.shuffle(indexes)
return selection[indexes[:N]]
Thanks! I actually also figured it out . I’m doing this: How to get a part of datasets? - #5 by ptrblck
Although now I am getting an error downloading the EMNIST file and I’m not sure why - this is my error
Downloading and extracting zip archive
Using downloaded and verified file: ./emnist_data/EMNIST/raw/emnist.zip
Extracting ./emnist_data/EMNIST/raw/emnist.zip to ./emnist_data/EMNIST/raw
Traceback (most recent call last):
File "GAN_test.py", line 57, in <module>
target_dataset = datasets.EMNIST(root='./emnist_data/', split = 'letters', train=True,
transform=transform, download=True)
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 237, in __init__ super(EMNIST, self).__init__(root, **kwargs)
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 68, in __init__ self.download()
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/mnist.py", line 260, in download remove_finished=True)
File "/usr/lib/python2.7/dist-packages/torchvision/datasets/utils.py", line 252, in download_and_extract_archive extract_archive(archive, extract_root, remove_finished) File "/usr/lib/python2.7/dist-packages/torchvision/datasets/utils.py", line 231, in extract_archive with zipfile.ZipFile(from_path, 'r') as z:
File "/usr/lib/python2.7/zipfile.py", line 793, in __init__ self._RealGetContents()
File "/usr/lib/python2.7/zipfile.py", line 834, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file
Not sure why this is happening, I’ve been downloading this dataset this way for months
Could you check the size of the emnist.zip
file and also see, if you could manually unzip it?
I guess the download might have failed, so you could try to re-download the file.
I’ve manually unzipped the file after downloading it from Kaggle but how do I only use the ‘letters’ portion? do I just pass the train_letters.pt
file in my code?
Would you like to share your full solution here, so it is ready to use in case someone else need something similar?
This is how I’m sampling equally from each class of the dataset
def _create_samples(dataset, num_classes):
N = int(np.ceil(k_samp / num_classes)) # k_samp is the number of total samples I need
indices = np.arange(len(dataset))
train_indices, test_indices = train_test_split(indices, train_size = N * num_classes , stratify = dataset.targets)
# Warp into Subsets and DataLoaders
train_dataset = torch.utils.data.Subset(target_dataset, train_indices)
return train_dataset