Wanted to show that this is better

``````w = torch.randn(2,5)
``````

``````w = torch.randn(2, 5, requires_grad=True)
``````

not to include gradients of initialization.

``````w = torch.randn(2, 5, requires_grad=True)
w.backward(retain_graph=True)
``````

But my example to show grads failed with the error:

RuntimeError: grad can be implicitly created only for scalar outputs

How can I read the gradients of `w` initialization in the second case?

I don’t really understand wat are you asking for.
In addition you only can backward scalars. w is a tensor.

Gradients are not “initialize”, they are equal to None until you backpropagate something.
However nothing will be returned as there are no gradients… (unless you backprop an scalar which involves w)

Both

``````w = torch.randn(2,5)
``````

and

``````w = torch.randn(2, 5, requires_grad=True)
``````

will work in PyTorch without any compilation errors. I tried hard to make this example to get the gradient of a jump from uninitialized to random value, but I am missing some important tip to check this.

How can I do that. This looks like the way.

You can brackpropagate w.mean(), w.sum() or whatever which returns an scalar. Anyway I there is no such a thing as initialization…

Thanks @JuanFMontesinos, ah, I had such a hard time making this test, you helped.

Here is what I have when using the function `w.mean`.

``````w = torch.randn(2, 5, requires_grad=True)
print(w)
r = w.mean()
r.backward(retain_graph=True)
# tensor([[ 0.6635, -0.3868,  2.0399,  0.3369, -0.5601],
#         [ 0.0737,  0.7711, -1.0142,  0.8241,  0.7964]], requires_grad=True)
# tensor([[0.1000, 0.1000, 0.1000, 0.1000, 0.1000],
#         [0.1000, 0.1000, 0.1000, 0.1000, 0.1000]])
``````

I am afraid I cannot understand this result. Do we have in here the gradients accumulation?

You have the derivative of a sum wrt to each element: d((x+sum)N)/dx for each element which equals 1/10

What exactly would you like to show?
You can pass a gradient of the same shape as `w`, which would only set `w.grad` to this particular value:

``````w = torch.randn(2, 5, requires_grad=True)
w.backward(torch.ones_like(w))
> tensor([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
``````

@ptrblck, this url has the intriguing text: " For the weights, we set `requires_grad` after the initialization, since we don’t want that step included in the gradient." which I wanted to prove.

Maybe this is not more the case.

I would recommend to wrap evenrything in a `with torch.no_grad()` block, which should not be tracked by Autograd.
While this approach of manually setting `requires_grad(True)` will most likely work for this manual approach, it’s a bit clearer for me to just use `torch.no_grad`.

OK, this is something I will certainly do when I don’t plan to track autograd, but in here, I just wanted to check if we can track gradients of random initialization:

``````w = torch.randn(2, 5, requires_grad=True)
``````

Here is what you are trying to prove. Let T1 and T2 be two tensors created by the `torch.randn()` function, provided the same random seed, with the only difference them between being the moment at which `requieres_grad` is set to `True`. That is,

``````seed = 42
# use seed to create the first random tensor
torch.random.manual_seed(seed)
T1 = torch.randn(2,5, requires_grad = True)
# use the same seed to create the second random tensor
torch.random.manual_seed(seed)
T2 = torch.randn(2,5)
T2.requieres_grad_(True) # notice the inplace operation
``````

Now, let us perform the exact same operations in both tensors `T1` and `T2`. In this way, once we call the `backward()` method with some different tensor of the same shape as input (in this case I chose a tensor of all ones), both tensors `T1` and `T2` should have the same value at their `grad` atribute.

``````# for T1
x1 = 3 * T1
y1 = x1 + 1
z1 = y1 * y1

# for T2
x2 = 3 * T2
y2 = x2 + 1
z2 = y2 * y2

# calling the backward method
z1.backward(torch.ones_like(z1))
z2.backward(torch.ones_like(z2))

# printing the .grad for T1 and T2
#tensor([[ 12.0604,   8.3186,  10.2203,  10.1460, -14.2114],
#        [  2.6461,  45.7476,  -5.4839,  14.3098,  10.8123]])
#tensor([[ 12.0604,   8.3186,  10.2203,  10.1460, -14.2114],
#        [  2.6461,  45.7476,  -5.4839,  14.3098,  10.8123]])
``````

You get the same value for both.

HOWEVER, the intriguing text inside the link you are referring to (url), aims for something different. Let me paste the code that originated the confussion,

``````weights = torch.randn(784, 10) / math.sqrt(784)
For this case, it does matter where `requieres_grad` is set to `True`. If you try to set it at the first line, like this
``````weights = torch.randn(784, 10, requires_grad=True) / math.sqrt(784)
after arbitatry operations are performed on `weights` and the `backward()`method is called, you will see a warning from PyTorch saying that yout are trying to access the `grad` attribute of a non leaf tensor, so `weights.grad` is set to `None`. Why? Becasue in such case, `weights` does not follow the definition of a leaf tensor: A leaf Variable is a variable that no operation tracked by the autograd engine created it (see this post for further examples). So, what is keeping `weights` from being a leaf variable? The division by `sqrt(784)`.