hello every one
there is a quastion when i use torch to optimize my simple model
it works when they are in cpu(no .cuda())
when it moves to cuda and it fails to compute
i hope someone help me to solve this
thanks in advance

N, D_in, H, D_out = 64, 1000, 100, 10

X = torch.randn(N,D_in).cuda()
Y = torch.randn(N,D_out).cuda()
W1 = torch.randn(D_in,H,requires_grad=True).cuda()
W2 = torch.randn(H, D_out,requires_grad=True).cuda()

learning_rate = 1e-6

for t in range(500): #forward propagation
h = X.mm(W1) #N * H
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(W2) #N * D_out

#loss function
loss = (y_pred-Y).pow(2).sum()
#print('y',Y)
#print('y pred',y_pred)
print(loss.item())
#backword propagation
loss.backward()
with torch.no_grad():
#update weights
W1 -= learning_rate * W1.grad
W2 -= learning_rate * W2.grad
W1.grad.zero_()
W2.grad.zero_()

When you do W1 = torch.randn(D_in,H,requires_grad=True).cuda() what is returned and stored as W1 is not a leaf anymore: it is the result of the differentiable op .cuda().
You should do W1 = torch.randn(D_in,H, device="cuda", requires_grad=True) to make sure W1 is a leaf and thus will get a .grad field.

hey guys would you mind telling me some detail about the " differentiable op .cuda()` ." or some links helping understanding to find out the principle?
that will be grateful

a = torch.rand(10, requires_grad=True)
b = a + 1
b.sum().backward()

Then b.grad will be None.
The same thing happens if you replace the + 1 op by .cuda().
It is handled like any other differentiable operation on Tensor.