Data Parallel Operation Giving error

Rafi_zunaed · March 8, 2020, 8:34pm

my below code snippet:

model = … #somemodel
model = model.cuda()
model, optimizer = amp.initialize(model, optimizer, opt_level=“O2”)
model = nn.DataParallel(model)

in dataloader batch

images = images.cuda()
output = model(images)

last line is giving me this error:
Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

i am running this code on gcp 2xT4 server. What portion of the above mentioned code is wrong or needs to be corrected to use multi gpu training?
[pytorch version 1.1.0]

albanD · March 8, 2020, 9:05pm

Hi,

I think this is a known limitation of amp. You can see this issue on their repo for more info: https://github.com/NVIDIA/apex/issues/503