Should it really be necessary to do var.detach().cpu().numpy()?

I have a CUDA variable that is part of a differentiable computational graph. I want to read out its value into numpy (say for plotting).

If I do var.numpy() I get RuntimeError: Can’t call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

Ok, so I do var.detach().numpy() and get TypeError: can’t convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first

Ok, so I go var.detach().cpu().numpy() and it works.

My question is: Is there any good reason why this isn’t just done within the numpy() method itself? It’s cumbersome and litters the code to have all these *.detach().cpu().numpy()'s sitting all around.


I have the same question. When a user calls numpy() on a variable, I think he / she must also wants that variable on cpu and is detached.
Don’t know how the PyTorch guys think, but i think there should be a function to get the inner values of a tensor.


The main reason behind this choice I think is to avoid confusing new comers. People not very familiar with requires_grad and cpu/gpu Tensors might go back and forth with numpy. For example doing pytorch -> numpy -> pytorch and backward on the last Tensor. This will backward without issue but not all the way to the first part of the code and won’t raise any error.
So the choice has been made to force the user to detach() to make sure they want to do it and it’s not a typo/other library that does this tranformation and breaks the computational graph.


Fair enough - but could we at least get rid of the need for X.cpu().numpy()? Seems X.numpy() alone should be enough.

The reason for requiring explicit .cpu() is that CPU tensors and the converted numpy arrays share memory. If a .cpu() is implicitly done, the operation will be different for CUDA and CPU tensors, and we wanted to be explicit to avoid bugs.


The explicitness of pytorch is most of what I’m enjoying relative to tensorflow2.

As a quick follow-up to this question, is there any difference between var.detach().cpu() and var.cpu().detach() ?


If var requires gradient, then var.cpu().detach() constructs the .cpu autograd edge, which soon gets destructed since the result is not stored. var.detach().cpu() does not do this. However, this is very fast so virtually they are the same.


Ok but why wouldn’t it do backward “all the way to the first part of the code”?

1 Like

Because we are only able to provide gradients for pytorch ops, not other ops (like numpy or other libraries).

1 Like

I’m a little confused. var.clone() will keep both tensors in computation graph, var.cpu() will return a copy of this tensor not on cuda but on cpu right? So will var and var.cpu() both remain in the computation graph? Will they share memory?

var(on cuda) and var.cpu() are on different devices, they cannot share memory. So var and var.cpu() will act like var and var.clone() except that they are on different devices?

And If I use var and var.cpu().cuda() , it will act like var and var.clone()?

And If I use var and var.cpu().cuda() , it will act like var and var.clone() ?

If the var was originally on the GPU yes.

So var and var.cpu() will act like var and var.clone() except that they are on different devices?


1 Like