Model.to(device) vs. tensor.to(device)

dkleine · May 19, 2025, 12:44pm

I recently ran into this discussion referencing the difference in the assignment of my_model.to(device) and my_tensor.to(device). While currently the model will be moved in-place to the device, tensors must be assigned to the same variable to be moved to the device (my_tensor = my_tensor.to(device)).

I have searched this forum for similar topics explaining this logic. There a lot of discussions on the specific usage of .to(device), but none that explicitly addresses why PyTorch implements these methods differently for models vs. tensors

From my point-of-view, the issue with this logic is when using my_model.to(device) along with my_tensor.to(device) a device mismatch will occur, without the problem being directly visible. As a developer, you would need to know which objects are tensors or not in the code (especially hard for PyTorch beginners). Also, the logic is inconsistent with other in-place methods, for example .add_(), that contain an underscore _ specifically for this reason.

Is there any particular logic or design philosophy behind this decision to have different behaviors for the same method name? For me, it seems counterintuive that .to(device) behaves differently on object types. Would it make sense to make the behaviour of .to(device) consistent for models and tensors, for example making both my_model.to(device) and my_tensor.to(device) performed in-place?

ptrblck · May 19, 2025, 1:51pm

To resolve the inconsistency, you could make sure to always assign the return value for to() calls as:

model = model.to(device)
tensor = tensor.to(device)

will also work.

I see the inplace definition w.r.t. the actual data and memory. I.e. I expect an inplace operation to process the already allocated memory and perform every operation on values stored in this memory. Moving data from one device to another thus cannot be an inplace operation unless you abstract the memory model and use something like managed memory (which will move memory behind your back to the device).

My guess is the design was chosen for the sake of simplicity (iterate recursively through all registered modules, parameters, as well as buffers, and move them to the device) and to avoid high memory foot prints. I.e. a naive alternative approach would be to clone the model (2x memory usage) before moving the data to the device.

dkleine · June 18, 2025, 6:39pm

Got it — thank you so much for your thorough and helpful response!