Should we add a transform ToVariable?

We already have a ToTensor class that transform numpy-style image into a torch tensor. It seems that a ToVariable class could also added to boost the data loading performance via multiprocess at the data loading step. Does this idea make sense ? Thanks.

class ToVariable(object):
    """Convert Tensors in sample to Variable."""

    def __call__(self, sample):
        return Variable(sample)

Converting a tensor to a Variable doesn’t incur any noticeable time penalty, so I don’t see why it would make things faster.
I think the best is just to convert the tensors just after they are returned by the dataloader, so that we have only a single tensor to convert to Variable

1 Like

Thanks. It doesn’t work either, since a data loader don’t recognize a Variable:

TypeError: batch must contain tensors, numbers, dicts or lists; found <class 'torch.autograd.variable.Variable'>

I think you wouldn’t want this because the transforms occur before the collate function in DataLoader. Instead, you might want to create a custom collate function. Below is a quick example

import torch
import torch.utils.data as data
from torch.autograd import Variable

def variable_collate(batch):
    """Puts batch of inputs, labels each into a Variable.
       Args:
         batch: (list) [inputs, labels].  In this simple example, I'm just assuming the input and labels are already Tensor types
       Output:
         minibatch: (Variable)
         targets: (Variable)
    """
    minibatch, targets = zip(*[(a, b) for (a,b) in batch])
    minibatch, targets = torch.stack(minibatch, dim=0), torch.stack(targets, dim=0)
    minibatch, targets = Variable(minibatch), Variable(targets)
    return minibatch, targets
    
X = torch.arange(0, 10).view(-1, 2)
Y = torch.zeros(5).view(-1, 1)

ds = data.TensorDataset(X, Y)
dl = data.DataLoader(ds, batch_size=1, collate_fn=variable_collate)

for mb, tgts in dl:
    print(mb, tgts)
1 Like

Why do you want to return Variables in the Dataset? I would avoid having that pattern actually. But if you really want to, then you can provide your own collate_fn as pointed out by @dhpollack.

@dhpollack @fmassa Thank you so much. I just thought that it would be more efficient to leave the “Variable” procedure to multiple processes. It would not be necessary if it takes almost no time.
And I start to have another question now. Does it make sense to copy the data to the gpu at the data loading step?

class ToTensor(object):
    """Convert ndarrays in sample to Tensors."""

    def __init__(self, phase_cuda=False):
        self.phase_cuda = phase_cuda

    def __call__(self, sample):
        image, landmarks = sample['image'], sample['landmarks']

        # swap color axis because
        # numpy image: H x W x C
        # torch image: C X H X W
        image = image.transpose((2, 0, 1))
        return {'image': torch.from_numpy(image).cuda() if self.phase_cuda else torch.from_numpy(image),
                'landmarks': torch.from_numpy(landmarks).cuda() if self.phase_cuda else torch.from_numpy(landmarks)}

If you are using multiple threads for data loading, that might not be necessary, but it depends on several factors

Thank you. It didn’t work either, cuda operation like cuda() fail in multiprocess worker.