Should we add a transform ToVariable?

JindongJiang · September 6, 2017, 3:52am

We already have a ToTensor class that transform numpy-style image into a torch tensor. It seems that a ToVariable class could also added to boost the data loading performance via multiprocess at the data loading step. Does this idea make sense ? Thanks.

class ToVariable(object):
    """Convert Tensors in sample to Variable."""

    def __call__(self, sample):
        return Variable(sample)

fmassa · September 6, 2017, 8:04pm

Converting a tensor to a Variable doesn’t incur any noticeable time penalty, so I don’t see why it would make things faster.
I think the best is just to convert the tensors just after they are returned by the dataloader, so that we have only a single tensor to convert to Variable

JindongJiang · September 7, 2017, 2:43am

Thanks. It doesn’t work either, since a data loader don’t recognize a Variable:

TypeError: batch must contain tensors, numbers, dicts or lists; found <class 'torch.autograd.variable.Variable'>

dhpollack · September 7, 2017, 9:05am

I think you wouldn’t want this because the transforms occur before the collate function in DataLoader. Instead, you might want to create a custom collate function. Below is a quick example

import torch
import torch.utils.data as data
from torch.autograd import Variable

def variable_collate(batch):
    """Puts batch of inputs, labels each into a Variable.
       Args:
         batch: (list) [inputs, labels].  In this simple example, I'm just assuming the input and labels are already Tensor types
       Output:
         minibatch: (Variable)
         targets: (Variable)
    """
    minibatch, targets = zip(*[(a, b) for (a,b) in batch])
    minibatch, targets = torch.stack(minibatch, dim=0), torch.stack(targets, dim=0)
    minibatch, targets = Variable(minibatch), Variable(targets)
    return minibatch, targets
    
X = torch.arange(0, 10).view(-1, 2)
Y = torch.zeros(5).view(-1, 1)

ds = data.TensorDataset(X, Y)
dl = data.DataLoader(ds, batch_size=1, collate_fn=variable_collate)

for mb, tgts in dl:
    print(mb, tgts)

fmassa · September 7, 2017, 3:36pm

Why do you want to return Variables in the Dataset? I would avoid having that pattern actually. But if you really want to, then you can provide your own collate_fn as pointed out by @dhpollack.

JindongJiang · September 7, 2017, 5:20pm

@dhpollack @fmassa Thank you so much. I just thought that it would be more efficient to leave the “Variable” procedure to multiple processes. It would not be necessary if it takes almost no time.
And I start to have another question now. Does it make sense to copy the data to the gpu at the data loading step?

class ToTensor(object):
    """Convert ndarrays in sample to Tensors."""

    def __init__(self, phase_cuda=False):
        self.phase_cuda = phase_cuda

    def __call__(self, sample):
        image, landmarks = sample['image'], sample['landmarks']

        # swap color axis because
        # numpy image: H x W x C
        # torch image: C X H X W
        image = image.transpose((2, 0, 1))
        return {'image': torch.from_numpy(image).cuda() if self.phase_cuda else torch.from_numpy(image),
                'landmarks': torch.from_numpy(landmarks).cuda() if self.phase_cuda else torch.from_numpy(landmarks)}

fmassa · September 8, 2017, 8:09pm

If you are using multiple threads for data loading, that might not be necessary, but it depends on several factors

JindongJiang · September 12, 2017, 1:39am

Thank you. It didn’t work either, cuda operation like cuda() fail in multiprocess worker.