Help with multigpu and DataParallel for the transfer_learning_tutorial

I am trying to convert the example transfer_learning_tutorial_multigpu.ipynb to multi gpu. In my case [0, 1].

I can control the one gpu case with …

> model =model.cuda(dev_id)
> inputs, labels = Variable(inputs).cuda(dev_id), Variable(labels).cudadev_id(dev_id)             
> outputs = model(inputs)
> _, preds = torch.max(outputs.data, 1)
> loss = criterion(outputs, labels)

Everything (model, inputs, labels) lives on gpu 0 or gpu 1.

Now I get very confused when using DataParallel.

I have some success with forcing everything onto 1 gpu.

dev_id= 0 or 1
model =torch.nn.DataParallel(model, device_ids=[dev_id]).cuda(dev_i)
inputs, labels = Variable(inputs).cuda(dev_id), Variable(labels).cuda(dev_id)             
outputs = model(inputs)
_, preds = torch.max(outputs.data, 1)
loss = criterion(outputs, labels)

But, when I move to multiple gpus… My understanding (and success) all falls apart.

1. model =torch.nn.DataParallel(model, device_ids=[0,1]).cuda()
2. inputs, labels = Variable(inputs), Variable(labels).cuda(0)             
3. outputs = model(inputs)
4. _, preds = torch.max(outputs.data, 1)
5. loss = criterion(outputs, labels)

I’ve seen the imagenet example…but still do not understand.
Why do I need the first .cuda()?
How do I properly put my input on a gpu? Or do I? The examples do not use a input.cuda().
Can I always expect (if available) my output to be on device_id=0?

It seems so simple, but…