How to use one class of number in MNIST

wxystudio · October 1, 2018, 2:01am

Hello I’m study the MNIST and want to train a model with only number “1”, but I don’t know how to extract the “1” class out of the total dataset… I only know the code:
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

Thanks

ptrblck · October 1, 2018, 4:03am

You could get the indices for all class1 labels and then index the labels and data:

dataset = datasets.MNIST(root='./data')
idx = dataset.train_labels==1
dataset.train_labels = dataset.train_labels[idx]
dataset.train_data = dataset.train_data[idx]

However, your model won’t learn anything as you just have one class.
Could you explain your use case a bit?
I would at least try to keep two classes in the dataset.

Or do you want to train your model to recognize the number 1 as the valid class and all remaining numbers as false?
If so, you could try this code:

idx = dataset.train_labels != 1
dataset.train_labels[idx] = 0

wxystudio · October 1, 2018, 7:40am

Thank you! I think I actually want to train more than one class of number, it’s a typo

anima · April 18, 2019, 3:00pm

says “AttributeError: can’t set attribute.” on dataset.train_labels = dataset.train_labels[idx] line. What is wrong?

ptrblck · April 18, 2019, 3:05pm

Could you try to access now dataset.data and dataset.targets?
I think the attribute was renamed in recent torchvision versions.

anima · April 18, 2019, 3:09pm

thank you, it worked.

Anish_Batra · April 21, 2019, 11:05pm

This should work. Change the if condition in my_collate according to your needs.

def my_collate(batch):
    modified_batch = []
    for item in batch:
        image, label = item
        if label == 1:
            modified_batch.append(item)
    return default_collate(modified_batch)

train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=True, download=True, transform=transform),
    batch_size=batch_size, shuffle=True, collate_fn = my_collate)

Pranaymodukuru · July 31, 2019, 8:28am

I tried using the same on CIFAR10, but it does not work.

idx = dataset.targets ==1

This part gives only one bool value ‘true’ or ‘false’ as output, and not a list.
Is it different for different Datasets or is there is a standard way of doing it?

ptrblck · July 31, 2019, 8:34am

It is most likely different.
For CIFAR10, you could wrap the targets in a tensor and use the comparison:

dataset = datasets.CIFAR10(root='./data')
idx = torch.tensor(dataset.targets)==1

Pranaymodukuru · July 31, 2019, 8:43am

I tried this,

train_dataset = datasets.CIFAR10(root='./data')
idx = torch.tensor(train_dataset.targets) == 1
dataset.data = dataset.data[idx]
dataset.targets = dataset.targets[idx]

I am getting this error

=>              dataset.targets = dataset.targets[idx]

TypeError: only integer tensors of a single element can be converted to an index

ptrblck · July 31, 2019, 9:36am

This should work:

dataset = datasets.CIFAR10(root='./data',
                           transform=transforms.ToTensor())
dataset.targets = torch.tensor(dataset.targets)
idx = dataset.targets==1
dataset.targets= dataset.targets[idx]
dataset.data = dataset.data[idx.numpy().astype(np.bool)]

for data, target in dataset:
    print(data.sum())
    print(target)

However, I would consider these approaches as hacks and there might be some side effects I’m currently not aware of.
The clean approach would be to override the dataset class and manipulate the underlying data as you wish.

Pranaymodukuru · July 31, 2019, 11:08am

Thank you, your solution works for me.

I also found another method here:

I also included an example implementation here:

def get_indices(dataset,class_name):
    indices =  []
    for i in range(len(dataset.targets)):
        if dataset.targets[i] == class_name:
            indices.append(i)
    return indices


dataset = torchvision.datasets.CIFAR10(root='./data',
                           transform=torchvision.transforms.ToTensor())

idx = get_indices(dataset, 1)
loader = Data.DataLoader(dataset,batch_size=64, sampler = Data.sampler.SubsetRandomSampler(idx))

for idx, (data, target) in enumerate(loader):
    print(target)

I hope this works for all the datasets available in torchvision.datasets.

ptrblck · July 31, 2019, 11:10am

That looks like a cleaner approach so I would recommend to stick to the SubsetRandomSampler.

Sourena_Yadegari · October 28, 2019, 1:08pm

Creating a dataset as a subset and then define data loaders. Assuming you only want labels 1 and 2

idx = torch.tensor(CIFAR10_train.targets) == 1
idx += torch.tensor(CIFAR10_train.targets) == 2
dset_train = torch.utils.data.dataset.Subset(CIFAR100_train, np.where(idx==1)[0])

idx = torch.tensor(CIFAR10_test.targets) == 1
idx += torch.tensor(CIFAR10_test.targets) == 2
dset_test = torch.utils.data.dataset.Subset(CIFAR100_test, np.where(idx==1)[0])

and dataloaders are defined as usual:

dl_train = torch.utils.data.DataLoader(dset_train, batch_size=8, shuffle=True)
dl_test = torch.utils.data.DataLoader(dset_test, batch_size=8, shuffle=True)

Pranaymodukuru · January 8, 2020, 9:23am

This works too and seems to be simpler than SubsetRandomSampler.

ani0075 · January 14, 2020, 4:40pm

@Sourena_Yadegari
Isn’t it also necessary to reindex the data to do something useful with this dataset?

For example the CrossEntropy criterion expects a class index in the range [0, C-1].
So, in this case 1,2 have to be reindexed as 0,1 otherwise the criterion will throw an error.

Pranaymodukuru · January 14, 2020, 7:40pm

Until the number of outputs from the network/model are 2 [since you are training on 2 classes], you will not face any issues.

Also if you have 10 outputs for the network and you are only training on 2 classes, this will not have an issue too.

This condition is violated only when your data set has for example “10 classes” and your network is having “2 outputs”, but not the other way around.

iman.mirzadeh · March 31, 2020, 3:10am

I used this approach.
Here’s my implementation for split-cifar100 (a common continual learning benchmark):

def get_split_cifar100(task_id, batch_size=32, shuffle=False):
	# convention: tasks starts from 1 not 0 !
	# task_id = 1 (i.e., first task) => start_class = 0, end_class = 4
	start_class = (task_id-1)*5
	end_class = task_id * 5

	transforms = torchvision.transforms.Compose([
		torchvision.transforms.ToTensor(),
		torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
		])

	train = torchvision.datasets.CIFAR100('./data/', train=True, download=True, transform=transforms)
	test = torchvision.datasets.CIFAR100('./data/', train=False, download=True, transform=transforms)
	
	targets_train = torch.tensor(train.targets)
	target_train_idx = ((targets_train >= start_class) & (targets_train < end_class))

	targets_test = torch.tensor(test.targets)
	target_test_idx = ((targets_test >= start_class) & (targets_test < end_class))

	train_loader = torch.utils.data.DataLoader(torch.utils.data.dataset.Subset(train, np.where(target_train_idx==1)[0]), batch_size=batch_size)
	test_loader = torch.utils.data.DataLoader(torch.utils.data.dataset.Subset(test, np.where(target_test_idx==1)[0]), batch_size=batch_size)

	return train_loader, test_loader

kawsheham · May 25, 2020, 4:33am

I want to create a sepearte dataloader ofr several subtasks. Each consisting identification of two digits? [0,1],[2,3],[4,5],[6,7],[8,9]? How can I do this?

ptrblck · May 25, 2020, 4:41am

You could use the same approach from this post and combine the conditions for the two wanted classes to create idx. The easiest approach would be to create 5 MNIST datasets and apply the conditions on each separately.

E.g. for the first DataLoader, this code should work:

dataset = datasets.MNIST(root='./data')
idx = (dataset.targets==0) | (dataset.targets==1)
dataset.targets = dataset.targets[idx]
dataset.data = dataset.data[idx]