Experiment: Read the gradients of random initialization

Here is what you are trying to prove. Let T1 and T2 be two tensors created by the torch.randn() function, provided the same random seed, with the only difference them between being the moment at which requieres_grad is set to True. That is,

seed = 42
# use seed to create the first random tensor
torch.random.manual_seed(seed)
T1 = torch.randn(2,5, requires_grad = True)
# use the same seed to create the second random tensor
torch.random.manual_seed(seed)
T2 = torch.randn(2,5)
T2.requieres_grad_(True) # notice the inplace operation

Now, let us perform the exact same operations in both tensors T1 and T2. In this way, once we call the backward() method with some different tensor of the same shape as input (in this case I chose a tensor of all ones), both tensors T1 and T2 should have the same value at their grad atribute.

# for T1
x1 = 3 * T1 
y1 = x1 + 1
z1 = y1 * y1

# for T2
x2 = 3 * T2 
y2 = x2 + 1
z2 = y2 * y2

# calling the backward method
z1.backward(torch.ones_like(z1))
z2.backward(torch.ones_like(z2))

# printing the .grad for T1 and T2
print(T1.grad)
print(T2.grad)
#tensor([[ 12.0604,   8.3186,  10.2203,  10.1460, -14.2114],
#        [  2.6461,  45.7476,  -5.4839,  14.3098,  10.8123]])
#tensor([[ 12.0604,   8.3186,  10.2203,  10.1460, -14.2114],
#        [  2.6461,  45.7476,  -5.4839,  14.3098,  10.8123]])

You get the same value for both.

HOWEVER, the intriguing text inside the link you are referring to (url), aims for something different. Let me paste the code that originated the confussion,

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()

For this case, it does matter where requieres_grad is set to True. If you try to set it at the first line, like this

weights = torch.randn(784, 10, requires_grad=True) / math.sqrt(784)

after arbitatry operations are performed on weights and the backward()method is called, you will see a warning from PyTorch saying that yout are trying to access the grad attribute of a non leaf tensor, so weights.grad is set to None. Why? Becasue in such case, weights does not follow the definition of a leaf tensor: A leaf Variable is a variable that no operation tracked by the autograd engine created it (see this post for further examples). So, what is keeping weights from being a leaf variable? The division by sqrt(784).

Try it yourself and let me know!

1 Like