Experiment: Read the gradients of random initialization

blackbirdbarber · June 21, 2019, 12:38pm

Wanted to show that this is better

w = torch.randn(2,5)
w.requires_grad_()

instead of

w = torch.randn(2, 5, requires_grad=True)

not to include gradients of initialization.

w = torch.randn(2, 5, requires_grad=True)
w.backward(retain_graph=True)
print(w.grad)

But my example to show grads failed with the error:

RuntimeError: grad can be implicitly created only for scalar outputs

How can I read the gradients of w initialization in the second case?

JuanFMontesinos · June 21, 2019, 1:57pm

I don’t really understand wat are you asking for.
In addition you only can backward scalars. w is a tensor.

Gradients are not “initialize”, they are equal to None until you backpropagate something.
You can access to gradients by doing w.grad
However nothing will be returned as there are no gradients… (unless you backprop an scalar which involves w)

blackbirdbarber · June 21, 2019, 4:07pm

Both

w = torch.randn(2,5)
w.requires_grad_()

and

w = torch.randn(2, 5, requires_grad=True)

will work in PyTorch without any compilation errors. I tried hard to make this example to get the gradient of a jump from uninitialized to random value, but I am missing some important tip to check this.

blackbirdbarber · June 21, 2019, 4:08pm

How can I do that. This looks like the way.

JuanFMontesinos · June 21, 2019, 7:08pm

You can brackpropagate w.mean(), w.sum() or whatever which returns an scalar. Anyway I there is no such a thing as initialization…

blackbirdbarber · June 21, 2019, 7:16pm

Thanks @JuanFMontesinos, ah, I had such a hard time making this test, you helped.

Here is what I have when using the function w.mean.

w = torch.randn(2, 5, requires_grad=True)
print(w)
r = w.mean()
r.backward(retain_graph=True)
print(w.grad)
# tensor([[ 0.6635, -0.3868,  2.0399,  0.3369, -0.5601],
#         [ 0.0737,  0.7711, -1.0142,  0.8241,  0.7964]], requires_grad=True)
# tensor([[0.1000, 0.1000, 0.1000, 0.1000, 0.1000],
#         [0.1000, 0.1000, 0.1000, 0.1000, 0.1000]])

I am afraid I cannot understand this result. Do we have in here the gradients accumulation?

JuanFMontesinos · June 21, 2019, 7:52pm

You have the derivative of a sum wrt to each element: d((x+sum)N)/dx for each element which equals 1/10

ptrblck · June 22, 2019, 5:48pm

What exactly would you like to show?
You can pass a gradient of the same shape as w, which would only set w.grad to this particular value:

w = torch.randn(2, 5, requires_grad=True)
w.backward(torch.ones_like(w))
print(w.grad)
> tensor([[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]])

blackbirdbarber · June 22, 2019, 6:11pm

@ptrblck, this url has the intriguing text: " For the weights, we set requires_grad after the initialization, since we don’t want that step included in the gradient." which I wanted to prove.

Maybe this is not more the case.

ptrblck · June 22, 2019, 6:37pm

I would recommend to wrap evenrything in a with torch.no_grad() block, which should not be tracked by Autograd.
While this approach of manually setting requires_grad(True) will most likely work for this manual approach, it’s a bit clearer for me to just use torch.no_grad.

blackbirdbarber · June 23, 2019, 6:57pm

OK, this is something I will certainly do when I don’t plan to track autograd, but in here, I just wanted to check if we can track gradients of random initialization:

w = torch.randn(2, 5, requires_grad=True)

ramonav98 · July 17, 2020, 5:29pm

Here is what you are trying to prove. Let T1 and T2 be two tensors created by the torch.randn() function, provided the same random seed, with the only difference them between being the moment at which requieres_grad is set to True. That is,

seed = 42
# use seed to create the first random tensor
torch.random.manual_seed(seed)
T1 = torch.randn(2,5, requires_grad = True)
# use the same seed to create the second random tensor
torch.random.manual_seed(seed)
T2 = torch.randn(2,5)
T2.requieres_grad_(True) # notice the inplace operation

Now, let us perform the exact same operations in both tensors T1 and T2. In this way, once we call the backward() method with some different tensor of the same shape as input (in this case I chose a tensor of all ones), both tensors T1 and T2 should have the same value at their grad atribute.

# for T1
x1 = 3 * T1 
y1 = x1 + 1
z1 = y1 * y1

# for T2
x2 = 3 * T2 
y2 = x2 + 1
z2 = y2 * y2

# calling the backward method
z1.backward(torch.ones_like(z1))
z2.backward(torch.ones_like(z2))

# printing the .grad for T1 and T2
print(T1.grad)
print(T2.grad)
#tensor([[ 12.0604,   8.3186,  10.2203,  10.1460, -14.2114],
#        [  2.6461,  45.7476,  -5.4839,  14.3098,  10.8123]])
#tensor([[ 12.0604,   8.3186,  10.2203,  10.1460, -14.2114],
#        [  2.6461,  45.7476,  -5.4839,  14.3098,  10.8123]])

You get the same value for both.

HOWEVER, the intriguing text inside the link you are referring to (url), aims for something different. Let me paste the code that originated the confussion,

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()

For this case, it does matter where requieres_grad is set to True. If you try to set it at the first line, like this

weights = torch.randn(784, 10, requires_grad=True) / math.sqrt(784)

after arbitatry operations are performed on weights and the backward()method is called, you will see a warning from PyTorch saying that yout are trying to access the grad attribute of a non leaf tensor, so weights.grad is set to None. Why? Becasue in such case, weights does not follow the definition of a leaf tensor: A leaf Variable is a variable that no operation tracked by the autograd engine created it (see this post for further examples). So, what is keeping weights from being a leaf variable? The division by sqrt(784).

Try it yourself and let me know!