How to utilize multiple gpu for variable-sized images

When processing images of fixed size on multiple gpu, the common way is:

model  = torch.nn.DataParallel(model).cuda()
output = model(input) # input's shape is NCHW

In the case of variable-sized images, I use the following collate function in dataloader, which returns a list of 3D Variable. The forward of my model will loop through the list and process each image as a minibatch (batchsize=1). The output size for each image is fixed, so they could be aggregated using

def list_collate(batch):
    return [item[0] for item in batch], [item[1] for item in batch]

The problem is, only the first gpu is utilized during training (the memory usage of other gpus is much lower than that of the first gpu). It seems that the scatter function in DataParallel duplicates list of Variable instead of split it along dim 0. Is it possible to utilize all the gpus in the situation of variable-sized input (and fix-sized output)?