Grad is NoneType

hello every one
there is a quastion when i use torch to optimize my simple model
it works when they are in cpu(no .cuda())
when it moves to cuda and it fails to compute
i hope someone help me to solve this
thanks in advance

N, D_in, H, D_out = 64, 1000, 100, 10

X = torch.randn(N,D_in).cuda()
Y = torch.randn(N,D_out).cuda()
W1 = torch.randn(D_in,H,requires_grad=True).cuda()
W2 = torch.randn(H, D_out,requires_grad=True).cuda()

learning_rate = 1e-6

for t in range(500):
#forward propagation
h = #N * H
h_relu = h.clamp(min=0)
y_pred = #N * D_out

#loss function
loss = (y_pred-Y).pow(2).sum()
#print('y pred',y_pred)
#backword propagation
with torch.no_grad():
#update weights
    W1 -= learning_rate * W1.grad
    W2 -= learning_rate * W2.grad


26 W1 -= learning_rate * W1.grad
27 W2 -= learning_rate * W2.grad
28 W1.grad.zero_()

TypeError: unsupported operand type(s) for *: ‘float’ and ‘NoneType’


When you do W1 = torch.randn(D_in,H,requires_grad=True).cuda() what is returned and stored as W1 is not a leaf anymore: it is the result of the differentiable op .cuda().
You should do W1 = torch.randn(D_in,H, device="cuda", requires_grad=True) to make sure W1 is a leaf and thus will get a .grad field.

God ! it really works and thanks for your flash reply .very thankful
best wishes

hey guys would you mind telling me some detail about the " differentiable op .cuda()` ." or some links helping understanding to find out the principle?
that will be grateful

It is the same as if you’re doing:

a = torch.rand(10, requires_grad=True)
b = a + 1


Then b.grad will be None.
The same thing happens if you replace the + 1 op by .cuda().
It is handled like any other differentiable operation on Tensor.

wow interesting and now i’ve got it. thanks for your reply and your reply is also easy to understand .
thank you again !!