Load mnist how to get the labels?

blackbirdbarber · June 19, 2019, 2:40pm

Strange but I would like to load mnist labels using torchvision.datasets.MNIST. I loaded images like this:

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=16, shuffle=False)

If I print the dataset I get this but where are the labels?

Dataset MNIST
Number of datapoints: 60000
Split: train
Root Location: /data/mnist
Transforms (if any): Compose(
ToTensor()
Normalize(mean=(0.1307,), std=(0.3081,))
)
Target Transforms (if any): None

What do you suggest? The origin is MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges

ptrblck · June 19, 2019, 2:48pm

You can print the labels using dataset.targets.

blackbirdbarber · June 19, 2019, 2:54pm

Happens to be that easy.

If I would try to work on this train=True dataset to split it into validation part 60000=50000+10000, would that be easy or I should use train=False to load another dataset (test dataset) so that should be my validation.

ptrblck · June 19, 2019, 2:56pm

I would suggest to wrap your training Dataset into Subset and pass the training and validation indices, while train=False would create the test dataset.

blackbirdbarber · June 19, 2019, 2:58pm

OK so just indexing will decide on train and validation, and test set is test set. I may not use that for my blitz experiments. Good.

blackbirdbarber · June 19, 2019, 3:23pm

I am using the subsets now, but their datasets point to the original set:

test_ds = torch.utils.data.Subset(train_loader.dataset, (0, 50000-1))
valid_ds = torch.utils.data.Subset(train_loader.dataset, (50000, 60000-1))
print(test_ds.indices) #(0, 49999)
print(valid_ds.indices) #(50000, 59999)
print(test_ds.dataset.targets[0]) #tensor(5)
print(valid_ds.dataset.targets[0]) #tensor(5)

#SequentialSampler ???

Should I cast them to Set or use a sampler… no idea.

ptrblck · June 20, 2019, 11:00pm

Subset is just a thin wrapper using the passed indices to index the underlying Dataset without actually manipulating it.
Have a look at this line of code.

You’ll get the right samples, if you index the Subset directly:

x, y = test_ds[0]

blackbirdbarber · June 21, 2019, 8:40am

Helpful:

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                (0.1307,), (0.3081,))
                             ])),
  batch_size=16, shuffle=False)
test_ds  = torch.utils.data.Subset(train_loader.dataset, range(0, 50000-1))
valid_ds = torch.utils.data.Subset(train_loader.dataset, range(50000, 60000-1))
for i in range(0, 10):
    ty = test_ds[i][1]
    vy = valid_ds[i][1]
    print(ty, vy)

out:

monster · June 4, 2020, 5:04pm

I have a subset of more then 30000 images and I am trying this code but it is taking forever.

labels=[]
for _,data in train_set:
    labels.append(data['labels'])
print(labels)

Is there any faster way of doing this?

ptrblck · June 5, 2020, 4:08am

If each target is lazily created, you would have to iterate the Dataset or DataLoader at least once.
Depending how train_set is defined, the targets might have been loaded completely in its __init__, so you could check it. Also, a DataLoader should speed up the loop, if you increase the batch size and use multiple workers.

How long does this loop take?

monster · June 5, 2020, 11:02am

It was taking more then 30 minutes,if I run loop for only 100 times it is also taking lots of time. I think this is because of tensor on each index.

ptrblck · June 6, 2020, 8:11am

I’m not sure, if you are using MNIST as in the original question, but this code takes ~1 second on my laptop:

dataset = datasets.MNIST(root=PATH)

labels = []
for _, label in dataset:
    labels.append(label)

monster · June 6, 2020, 4:12pm

I am loading different data from multiple folders with images and annotations(20000 images)

(tensor_data,['boxes':[....],'labels':'a'])=dataset[0]