Recommended way to move between CPU and GPU

aakash · June 9, 2020, 4:43am

Moving model and data between cpu and gpu with .to(device) or .cpu() or .cuda() could make the code messy if not done properly.
Are there any recommend guideline/standard practice to go about it?

What about moving data to gpu in the forward(self, ...) function of a network and returning results after moving them back to cpu?
something along the lines of

    def forward(self ,x):
        x = x.cuda()
        x = self.fc(x)
        x = x.cpu()
        return x

Is this a good idea or are there reasons to not go in this direction?
Thanks.

ptrblck · June 9, 2020, 5:10am

Moving or creating data inside the forward via cuda() might yield errors, if you are using nn.DataParallel or DistributedDataParallel, since the model will be copied to each specified device, while your cuda() call will move the data to the default device.
Of course you could specify the device id by reading it from the attribute of an internal parameter e.g. via x = x.to(self.param.device), but I would generally push the data to the device in the DataLoader loop.

I’m not sure, if you need to result on the CPU, but note that this call will be synchronizing your code.
If you are training and thus need the output for the further loss calculation, I would just leave the output on the device.

aakash · June 9, 2020, 6:44am

Fair enough, that makes sense. Thanks!
For my current situation I am playing around Reinforcement Learning algorithms and the inputs to the network could come from a variety of places including the gym and my replaybuffer…
Also right now I am doing some numpy operations after the forward pass and before the loss calculation …so maybe changing operations to run without grad on torch tensors, instead of using numpy will be more elegant in this case …

Pranjal_Agarwal · January 2, 2023, 11:08am

I too need to run some numpy operations (specifically scipy operations) after forward pass, before computing loss. I did not understand " so maybe changing operations to run without grad on torch tensors, instead of using numpy will be more elegant in this case . Please explain.