Does the optimizer have to be created after model is moved to the GPU?

ening · July 9, 2025, 4:20pm

codes

model = MllamaForConditionalGeneration.from_pretrained(
MODEL_NAME,
attn_implementation=“sdpa”,
device_map=“cpu”,
torch_dtype=torch.float32,
)
optim = AdamW(model.parameters(), lr=1e-5)
model.to(“cuda:0”)

Questions

In the codes above, I create mllama model on cpu and use its parameters to create optimizer, then move model to cuda.
Is this approach feasible?
Will there be any problems with updating the model parameters?

ptrblck · July 9, 2025, 5:09pm

Parameters are moved inplace so creating the optimizer before moving the model should be fine.
This simple code snippet also shows it:

model = torch.nn.Linear(10, 10)
optimizer = torch.optim.Adam(model.parameters())
print([p.device for p in optimizer.param_groups[0]['params']])
# [device(type='cpu'), device(type='cpu')]

model.cuda()
print([p.device for p in optimizer.param_groups[0]['params']])
# [device(type='cuda', index=0), device(type='cuda', index=0)]

ening · July 9, 2025, 7:40pm

I understand! Thanks a lot!

ptrblck · July 9, 2025, 7:41pm

Did you see any issues or is this a general question?
Generally, I would still recommend finishing the actual model setup before creating the optimizer, especially if you are applying more advanced model sharding approaches.

ening · July 10, 2025, 9:47am

This problem typically occurs when using FSDP or DeepSpeed to accelerate large model training. When using transformers library, it is easy to set optimizer config to create optimizer. However, if using own custom library for training, we usually pass directly model, optimizer and lr_scheduler to SFT Interface. In this case, model will be processed, such as: Move to Distrubuted GPUs and use FSDP or DeepSpeed Wrap. So we hope that the model location is on the CPU and using the model parameters to create optimizer.