If this example is run as-is, i think it should work. Are you using a slightly different variant in practice? For example, after constructing the optimizer are you moving your model to .cuda()
(rather than the corrrect ordering of model.cuda(); optimizer = ...
)?