Load mnist how to get the labels?

Strange but I would like to load mnist labels using torchvision.datasets.MNIST. I loaded images like this:

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=16, shuffle=False)

If I print the dataset I get this but where are the labels?

Dataset MNIST
Number of datapoints: 60000
Split: train
Root Location: /data/mnist
Transforms (if any): Compose(
ToTensor()
Normalize(mean=(0.1307,), std=(0.3081,))
)
Target Transforms (if any): None

What do you suggest? The origin is MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges

You can print the labels using dataset.targets.

Happens to be that easy.

If I would try to work on this train=True dataset to split it into validation part 60000=50000+10000, would that be easy or I should use train=False to load another dataset (test dataset) so that should be my validation.

I would suggest to wrap your training Dataset into Subset and pass the training and validation indices, while train=False would create the test dataset.

OK so just indexing will decide on train and validation, and test set is test set. I may not use that for my blitz experiments. Good.

I am using the subsets now, but their datasets point to the original set:

test_ds = torch.utils.data.Subset(train_loader.dataset, (0, 50000-1))
valid_ds = torch.utils.data.Subset(train_loader.dataset, (50000, 60000-1))
print(test_ds.indices) #(0, 49999)
print(valid_ds.indices) #(50000, 59999)
print(test_ds.dataset.targets[0]) #tensor(5)
print(valid_ds.dataset.targets[0]) #tensor(5)

#SequentialSampler ???

Should I cast them to Set or use a sampler… no idea.

Subset is just a thin wrapper using the passed indices to index the underlying Dataset without actually manipulating it.
Have a look at this line of code.

You’ll get the right samples, if you index the Subset directly:

x, y = test_ds[0]

Helpful:

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                (0.1307,), (0.3081,))
                             ])),
  batch_size=16, shuffle=False)
test_ds  = torch.utils.data.Subset(train_loader.dataset, range(0, 50000-1))
valid_ds = torch.utils.data.Subset(train_loader.dataset, range(50000, 60000-1))
for i in range(0, 10):
    ty = test_ds[i][1]
    vy = valid_ds[i][1]
    print(ty, vy)

out:

5 3
0 8
4 6
1 9
9 6
2 4
1 5
3 3
1 8
4 4

I have a subset of more then 30000 images and I am trying this code but it is taking forever.

labels=[]
for _,data in train_set:
    labels.append(data['labels'])
print(labels)

Is there any faster way of doing this?

If each target is lazily created, you would have to iterate the Dataset or DataLoader at least once.
Depending how train_set is defined, the targets might have been loaded completely in its __init__, so you could check it. Also, a DataLoader should speed up the loop, if you increase the batch size and use multiple workers.

How long does this loop take?

It was taking more then 30 minutes,if I run loop for only 100 times it is also taking lots of time. I think this is because of tensor on each index.

I’m not sure, if you are using MNIST as in the original question, but this code takes ~1 second on my laptop:

dataset = datasets.MNIST(root=PATH)

labels = []
for _, label in dataset:
    labels.append(label)

I am loading different data from multiple folders with images and annotations(20000 images)

(tensor_data,['boxes':[....],'labels':'a'])=dataset[0]