Yeah it looks like what’s happening is that x = Variable(torch.zeros(…), requires_grad=True).cuda() creates an intermediate Variable y = Variable(torch.zeros(...), requires_grad=True)
and then assigns x = y.cuda()
.
Since y
is the leaf node, the gradients only accumulate in y
and not x
.