gpus = [0, 1]
model_1 = DataParallel(model_1, device_ids=gpus) # such as resnet without classifier
model_2 = DataParallel(model_2, device_ids=gpus) # a classifier contains dropout and linear
opt.zero_grad() # SGD optimizer
feat = model_1(x)
print("forward model_1")
prob = model_2(feat)
print("forward model_2")
loss = CrossEntropyLoss(prob, y)
print("forward loss")
loss.backward()
print("backward")
opt.step()
This program works fine on my toy dataset, which has only 200+ classes.
But, when I put my large dataset, which has 2 million classes, it seems be stuck at backward step.
Print
which can then be wrapped in a single DataParallel object. Depending on how you have instantiated your opt your current approach might not be working as intended, since your opt needs to be initialised with the model parameters. Which is why you should write it as follows:
model = full_model(...)
opt = torch.optim.Adam(model.parameters(),lr=lr)
model_dp = DataParallel(model, device_ids=gpus)
This way your parameters will be correctly updated when calling opt.step() since they likely weren’t before.
I think it will still work if you initialise an optimizer after a DataParellel model, but personally I think it’s better practice to initialise the optimizer on the base model itself, i.e., using model.parameters() in the optim constructor rather than DataParallel(model).parameters().
Just make sure that your optim is initialised with the full model parameters, hence wrapping model_1 and model_2 into a single model.