Why do we need Subsets at all?

blackbirdbarber · July 1, 2019, 12:48pm

I am using this code and I have single question:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torch.optim import *
import torchvision


dl = DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                (0.5,), (0.5,))
                             ])),
  batch_size=16, shuffle=False)

train_ds  = torch.utils.data.Subset(dl.dataset, range(0, 50000-1))
valid_ds = torch.utils.data.Subset(dl.dataset, range(50000, 60000-1))

# this won't work 
# train_ds = TensorDataset(train_ds, valid_ds)
# AttributeError: 'Subset' object has no attribute 'size'

tensor = dl.dataset.data
tr = tensor.reshape(tensor.size(0), -1)  # reshaped
targets = dl.dataset.targets

x_train = tr[0:50000-1]
y_train = targets[0:50000-1]
x_valid = tr[50000:60000-1]
y_valid = targets[50000:60000-1]

bs=64

x_train = x_train.to(dtype=torch.float32)
y_train = y_train.to(dtype=torch.long)

x_valid = x_valid.to(dtype=torch.float32)
y_valid = y_valid.to(dtype=torch.long)

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, drop_last=False, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)

loaders={}
loaders['train'] = train_dl
loaders['valid'] = valid_dl


class M(nn.Module):
    'custom module'
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):
        return self.lin(xb)

model = M()

criterion = nn.CrossEntropyLoss()

bs=64
epochs = 2
lr = 0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)

for epoch in range(0,epochs):
    train_loss = 0
    valid_loss = 0
    print(f"Epoch {epoch}")

    model.train()
    for i, (data,target) in enumerate(loaders['train']):
                
        optimizer.zero_grad()
        output = model(data)                                    
        loss = criterion(output, target)        
        loss.backward()            
        optimizer.step()
        
        if (i%100==0):
            print(f"Batch {i}, loss {loss}")

Question:

When I used torch.utils.data.Subset it gave me the “Subset object has no attribute size” error message, so had no idea how to use subsets to feed dataloaders with TensorDataset. To me at this level of understanding, Subset looks very impractical and it even creates confusion. Any comments why do we need them?

ptrblck · July 1, 2019, 1:50pm

Which line of code threw this error?
Could you explain your confusion a bit?
Subset is a convenient way of using certain indices to specify subsets of your Dataset.
While in your example you can easily split the loaded data manually, this won’t be possible in a lot of cases, where you would have to lazily load the data.

blackbirdbarber · July 1, 2019, 2:03pm

Which line of code threw this error?

To me the Subsets do not work well with TensorDataset, since Subsets are not the tensors. This line will fail:

train_ds = TensorDataset(train_ds, valid_ds)

Could you explain your confusion a bit?

I was not so sure why do we need Subsets. Since for MNIST dataset (60000 images) I cannot use them to make TensorDataset.

I am not sure if I can use something other than TensorDataset, but the idea was to
to move tensors on CUDA.

I guess it is faster to move data and targets on CUDA togeter, but I am not sure, maybe just data on CUDA is OK, and targes on CPU may be still be fast.

So based on making data and targets togeter I see TensorDataset as a good choice` for any unsupervised architecture, having data and targets.

Subset is a convenient way of using certain indices to specify subsets of your Dataset. While in your example you can easily split the loaded data manually, this won’t be possible in a lot of cases, where you would have to lazily load the data

This is maybe it, I haven’t experimented with lazy loading data.

ptrblck · July 1, 2019, 2:16pm

You don’t need to wrap the subsets again in TensorDatasets.
In fact, you would create your Dataset first (e.g. as a TensorDataset or any other) and wrap it into a Subset to get the samples for the corresponding indices. This subset can then be wrapped by a DataLoader.

Note that TensorDataset is used for tensors already loaded into memory.
If you are dealing with a large dataset, this might not be possible, so you could use another dataset, e.g. torchvision.datasets.ImageFolder or a custom Dataset implementation for lazy loading.

The usual way would be to load the samples into the RAM and push each batch onto the GPU in the training loop.
If your model is small and you are not working with a lot of data samples, you could of course push the whole dataset onto the GPU beforehand, but note that you might be wasting precious GPU memory.

blackbirdbarber · July 1, 2019, 4:37pm

You suggest I put a Subset to Dataloader but I just used Dataloader to provide data for the Subset.

I have few problems, with that, I never saw lazy loading, and I am not sure what is this exactly. Searching on PyTorch website gave me 0 results.

dl = DataLoader(
  torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),                               
                               torchvision.transforms.Normalize(
                                (0.5,), (0.5,))
                             ])), shuffle=False)

train_ds  = torch.utils.data.Subset(dl.dataset, range(0, 50000-1))
valid_ds = torch.utils.data.Subset(dl.dataset, range(50000, 60000-1))


tensor = dl.dataset.data
tensor = tensor.to(dtype=torch.float32)
tr = tensor.reshape(tensor.size(0), -1) 
tr = tr/128 # tr = tr/255
targets = dl.dataset.targets
targets = targets.to(dtype=torch.long)

It would be cool if I could use DataLoader (first line of code) to also scale my tensors by 128 or 256 (divide by 128 or 256) for faster learning and set the float dytpe so once I have Subsets I am perfect to use DataLoader again, but this time I guess two DataLoader, one for training and other for validation.

blackbirdbarber · July 1, 2019, 11:17pm

I found the idea of lazy loading is similar term like load on demand, or load single item (image) on demand to be more precise.

Load in here means load it from non GPU memory to GPU, which is what I was most interested.

This further means, writing a custom dataset and overwriting __getitem__ as described in here.

If we are writing a custom Dataset why do we need a Subset which is also a custom Dataset?

If we will load a single item, and we will do that for every item in the batch, wouldn’t that be slower, comparing to loading a batch from once (DataLoader).

Which leads to the quest when is the best time to load to GPU, in DataSet or in DataLoader?

blackbirdbarber · July 5, 2019, 8:10am

@ptrblck, can you share any example from Internet, where the lazy loading technique is used, and also some example where Subsets are used to load MNIST.

I recall I searched for hours but somehow I ended short.

ptrblck · July 5, 2019, 10:54am

You should find plenty of custom Datasets in this discussion board.
Here is an example:

class MyLazyDataset(Dataset):
    def __init__(self, image_paths, targets, transform=None):
        self.image_paths = image_paths
        self.targets = targets
        self.transform = transform
        
    def __getitem__(self, index):
        # Load actual image here
        x = Image.open(self.image_paths[index])
        if self.transform:
            x = self.transform(x)
        y = self.targets[index]
        
        return x, y
    
    def __len__(self):
        return len(self.image_paths)

As you can see, you are just passing the image paths without loading all images beforehand.
In the __getitem__ method, the actual sample is loaded and transformed.
In my example I would have to compute the targets before and pass them to MyLazyDataset.
However, you could of course create the target in __getitem__ e.g. by using the current image path and check for a class name.

If you are using ImageFolder, the images will be lazily loaded by default.

Here is a small example of the usage of Subset:

dataset = datasets.MNIST(
    root='./data',
    download=False,
    transform=transforms.ToTensor()
)


train_dataset = Subset(dataset, indices=range(50000))
val_dataset = Subset(dataset, indices=range(50000, 60000))

You just have to pass your Dataset and the indices you would like to use in the subset.