I am using this code and I have single question:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torch.optim import *
import torchvision
dl = DataLoader(
torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.5,), (0.5,))
])),
batch_size=16, shuffle=False)
train_ds = torch.utils.data.Subset(dl.dataset, range(0, 50000-1))
valid_ds = torch.utils.data.Subset(dl.dataset, range(50000, 60000-1))
# this won't work
# train_ds = TensorDataset(train_ds, valid_ds)
# AttributeError: 'Subset' object has no attribute 'size'
tensor = dl.dataset.data
tr = tensor.reshape(tensor.size(0), -1) # reshaped
targets = dl.dataset.targets
x_train = tr[0:50000-1]
y_train = targets[0:50000-1]
x_valid = tr[50000:60000-1]
y_valid = targets[50000:60000-1]
bs=64
x_train = x_train.to(dtype=torch.float32)
y_train = y_train.to(dtype=torch.long)
x_valid = x_valid.to(dtype=torch.float32)
y_valid = y_valid.to(dtype=torch.long)
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, drop_last=False, shuffle=True)
valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)
loaders={}
loaders['train'] = train_dl
loaders['valid'] = valid_dl
class M(nn.Module):
'custom module'
def __init__(self):
super().__init__()
self.lin = nn.Linear(784, 10)
def forward(self, xb):
return self.lin(xb)
model = M()
criterion = nn.CrossEntropyLoss()
bs=64
epochs = 2
lr = 0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
for epoch in range(0,epochs):
train_loss = 0
valid_loss = 0
print(f"Epoch {epoch}")
model.train()
for i, (data,target) in enumerate(loaders['train']):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if (i%100==0):
print(f"Batch {i}, loss {loss}")
Question:
When I used torch.utils.data.Subset
it gave me the “Subset object has no attribute size” error message, so had no idea how to use subsets to feed dataloaders with TensorDataset
. To me at this level of understanding, Subset
looks very impractical and it even creates confusion. Any comments why do we need them?
Which line of code threw this error?
Could you explain your confusion a bit?
Subset
is a convenient way of using certain indices to specify subsets of your Dataset
.
While in your example you can easily split the loaded data manually, this won’t be possible in a lot of cases, where you would have to lazily load the data.
2 Likes
Which line of code threw this error?
To me the Subsets
do not work well with TensorDataset, since Subsets are not the tensors. This line will fail:
train_ds = TensorDataset(train_ds, valid_ds)
Could you explain your confusion a bit?
I was not so sure why do we need Subsets. Since for MNIST dataset (60000 images) I cannot use them to make TensorDataset
.
I am not sure if I can use something other than TensorDataset
, but the idea was to
to move tensors on CUDA.
I guess it is faster to move data and targets on CUDA togeter, but I am not sure, maybe just data on CUDA is OK, and targes on CPU may be still be fast.
So based on making data and targets togeter I see TensorDataset
as a good choice` for any unsupervised architecture, having data and targets.
Subset is a convenient way of using certain indices to specify subsets of your Dataset. While in your example you can easily split the loaded data manually, this won’t be possible in a lot of cases, where you would have to lazily load the data
This is maybe it, I haven’t experimented with lazy loading data.
You don’t need to wrap the subsets again in TensorDatasets
.
In fact, you would create your Dataset
first (e.g. as a TensorDataset
or any other) and wrap it into a Subset
to get the samples for the corresponding indices. This subset can then be wrapped by a DataLoader
.
Note that TensorDataset
is used for tensors already loaded into memory.
If you are dealing with a large dataset, this might not be possible, so you could use another dataset, e.g. torchvision.datasets.ImageFolder
or a custom Dataset
implementation for lazy loading.
The usual way would be to load the samples into the RAM and push each batch onto the GPU in the training loop.
If your model is small and you are not working with a lot of data samples, you could of course push the whole dataset onto the GPU beforehand, but note that you might be wasting precious GPU memory.
2 Likes
You suggest I put a Subset
to Dataloader
but I just used Dataloader
to provide data for the Subset
.
I have few problems, with that, I never saw lazy loading, and I am not sure what is this exactly. Searching on PyTorch website gave me 0 results.
dl = DataLoader(
torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.5,), (0.5,))
])), shuffle=False)
train_ds = torch.utils.data.Subset(dl.dataset, range(0, 50000-1))
valid_ds = torch.utils.data.Subset(dl.dataset, range(50000, 60000-1))
tensor = dl.dataset.data
tensor = tensor.to(dtype=torch.float32)
tr = tensor.reshape(tensor.size(0), -1)
tr = tr/128 # tr = tr/255
targets = dl.dataset.targets
targets = targets.to(dtype=torch.long)
It would be cool if I could use DataLoader
(first line of code) to also scale my tensors by 128 or 256 (divide by 128 or 256) for faster learning and set the float dytpe
so once I have Subsets
I am perfect to use DataLoader
again, but this time I guess two DataLoader
, one for training and other for validation.
I found the idea of lazy loading is similar term like load on demand, or load single item (image) on demand to be more precise.
Load in here means load it from non GPU memory to GPU, which is what I was most interested.
This further means, writing a custom dataset and overwriting __getitem__
as described in here.
If we are writing a custom Dataset
why do we need a Subset
which is also a custom Dataset
?
If we will load a single item, and we will do that for every item in the batch, wouldn’t that be slower, comparing to loading a batch from once (DataLoader
).
Which leads to the quest when is the best time to load to GPU, in DataSet
or in DataLoader
?
@ptrblck, can you share any example from Internet, where the lazy loading technique is used, and also some example where Subsets are used to load MNIST.
I recall I searched for hours but somehow I ended short.
You should find plenty of custom Datasets
in this discussion board.
Here is an example:
class MyLazyDataset(Dataset):
def __init__(self, image_paths, targets, transform=None):
self.image_paths = image_paths
self.targets = targets
self.transform = transform
def __getitem__(self, index):
# Load actual image here
x = Image.open(self.image_paths[index])
if self.transform:
x = self.transform(x)
y = self.targets[index]
return x, y
def __len__(self):
return len(self.image_paths)
As you can see, you are just passing the image paths without loading all images beforehand.
In the __getitem__
method, the actual sample is loaded and transformed.
In my example I would have to compute the targets before and pass them to MyLazyDataset
.
However, you could of course create the target in __getitem__
e.g. by using the current image path and check for a class name.
If you are using ImageFolder
, the images will be lazily loaded by default.
Here is a small example of the usage of Subset
:
dataset = datasets.MNIST(
root='./data',
download=False,
transform=transforms.ToTensor()
)
train_dataset = Subset(dataset, indices=range(50000))
val_dataset = Subset(dataset, indices=range(50000, 60000))
You just have to pass your Dataset
and the indices you would like to use in the subset.
7 Likes