which can then be wrapped in a single DataParallel object. Depending on how you have instantiated your opt your current approach might not be working as intended, since your opt needs to be initialised with the model parameters. Which is why you should write it as follows:
model = full_model(...)
opt = torch.optim.Adam(model.parameters(),lr=lr)
model_dp = DataParallel(model, device_ids=gpus)
This way your parameters will be correctly updated when calling opt.step() since they likely weren’t before.
I think it will still work if you initialise an optimizer after a DataParellel model, but personally I think it’s better practice to initialise the optimizer on the base model itself, i.e., using model.parameters() in the optim constructor rather than DataParallel(model).parameters().
Just make sure that your optim is initialised with the full model parameters, hence wrapping model_1 and model_2 into a single model.