Strange but I would like to load mnist labels using torchvision.datasets.MNIST. I loaded images like this:
train_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=16, shuffle=False)
If I print the dataset I get this but where are the labels?
Dataset MNIST
Number of datapoints: 60000
Split: train
Root Location: /data/mnist
Transforms (if any): Compose(
ToTensor()
Normalize(mean=(0.1307,), std=(0.3081,))
)
Target Transforms (if any): None
What do you suggest? The origin is MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges
You can print the labels using dataset.targets.
ptrblck:
dataset.targets
Happens to be that easy.
If I would try to work on this train=True dataset to split it into validation part 60000=50000+10000, would that be easy or I should use train=False to load another dataset (test dataset) so that should be my validation.
I would suggest to wrap your training Dataset into Subset and pass the training and validation indices, while train=False would create the test dataset.
ptrblck:
Dataset into Subset
OK so just indexing will decide on train and validation, and test set is test set. I may not use that for my blitz experiments. Good.
I am using the subsets now, but their datasets point to the original set:
test_ds = torch.utils.data.Subset(train_loader.dataset, (0, 50000-1))
valid_ds = torch.utils.data.Subset(train_loader.dataset, (50000, 60000-1))
print(test_ds.indices) #(0, 49999)
print(valid_ds.indices) #(50000, 59999)
print(test_ds.dataset.targets[0]) #tensor(5)
print(valid_ds.dataset.targets[0]) #tensor(5)
#SequentialSampler ???
Should I cast them to Set or use a sampler… no idea.
Subset is just a thin wrapper using the passed indices to index the underlying Dataset without actually manipulating it.
Have a look at this line of code .
You’ll get the right samples, if you index the Subset directly:
x, y = test_ds[0]
Helpful:
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision
train_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=16, shuffle=False)
test_ds = torch.utils.data.Subset(train_loader.dataset, range(0, 50000-1))
valid_ds = torch.utils.data.Subset(train_loader.dataset, range(50000, 60000-1))
for i in range(0, 10):
ty = test_ds[i][1]
vy = valid_ds[i][1]
print(ty, vy)
out:
5 3
0 8
4 6
1 9
9 6
2 4
1 5
3 3
1 8
4 4
monster
(Monster)
June 4, 2020, 5:04pm
9
I have a subset of more then 30000 images and I am trying this code but it is taking forever.
labels=[]
for _,data in train_set:
labels.append(data['labels'])
print(labels)
Is there any faster way of doing this?
If each target is lazily created, you would have to iterate the Dataset or DataLoader at least once.
Depending how train_set is defined, the targets might have been loaded completely in its __init__, so you could check it. Also, a DataLoader should speed up the loop, if you increase the batch size and use multiple workers.
How long does this loop take?
monster
(Monster)
June 5, 2020, 11:02am
11
It was taking more then 30 minutes,if I run loop for only 100 times it is also taking lots of time. I think this is because of tensor on each index.
I’m not sure, if you are using MNIST as in the original question, but this code takes ~1 second on my laptop:
dataset = datasets.MNIST(root=PATH)
labels = []
for _, label in dataset:
labels.append(label)
monster
(Monster)
June 6, 2020, 4:12pm
13
I am loading different data from multiple folders with images and annotations(20000 images)
(tensor_data,['boxes':[....],'labels':'a'])=dataset[0]