Initialization in-place and `tensor.data`

murph213 · March 23, 2020, 7:39pm

Like others, I have found that trying to manually initialize weights in-place is successful when through weight.data but not through weight alone. But I am trying to understand this more deeply. The following code shows three possible (not necessarily equivalent) ways and whether they work.

import torch
import torch.nn as nn
import torch.init as init

def do_init(layer):
    # works
    init.xavier_uniform_(layer.weight)
    
    # does not work
    #layer.weight.random_()
    
    # works
    #layer.weight.data.random_()
    
lin = nn.Linear(4, 5)
print(lin.weight)
do_init(lin)
print(lin.weight)

Accessing weight directly causes errors of an in-place edit of a leaf variable. But how is accessing weight.data any different? How is it treated differently under the hood?
Is an in-place edit using data actually safe or does it confuse the computation graph? It seems we are tricking PyTorch into letting us manually initialize the gradients, but I am not comfortable trying to trick PyTorch without understanding the mechanism.
What do nn.init in-place methods do to avoid the “leaf variable” error? Do they also edit the data?

I tried looking at the code but could not determine for sure.

Thank you!

albanD · March 23, 2020, 7:49pm

Hi,

.data should never be used. It is doing very dangerous things under the hood and has many unwanted side effects.

To make operations that are not recorded by the autograd, you should use with torch.no_grad():.
This is exactly what the nn.init module does as you can seee here for example.

murph213 · March 23, 2020, 10:50pm

So it would seem that a safe way to implement a function that does my own custom initialization would be to use torch.no_grad as you suggested:

torch.manual_seed(1)
def do_init2(layer):
    with torch.no_grad():
        layer.weight.normal_(0, 10)  #  10 so we can check expected behavior
        layer.bias.fill_(0)

lin = nn.Linear(4, 5)
print("before")
print(lin.weight)
print(lin.bias)

do_init2(lin)

print("after")
print(lin.weight)
print(lin.bias)

output:

before
Parameter containing:
tensor([[ 0.2576, -0.2207, -0.0969,  0.2347],
        [-0.4707,  0.2999, -0.1029,  0.2544],
        [ 0.0695, -0.0612,  0.1387,  0.0247],
        [ 0.1826, -0.1949, -0.0365, -0.0450],
        [ 0.0725, -0.0020,  0.4371,  0.1556]], requires_grad=True)
Parameter containing:
tensor([-0.1862, -0.3020, -0.0838, -0.2157, -0.1602], requires_grad=True)
after
Parameter containing:
tensor([[ 11.7120,  17.6743,  -0.9536,   1.3937],
        [-12.1501,   7.3117,  11.7180,  -9.2739],
        [  5.4514,   0.6628,  -4.3704,   7.6260],
        [ 11.6327,  -0.0907,  -8.4246,   1.3741],
        [  9.3864,  -1.8600,  -6.4464,  15.3925]], requires_grad=True)
Parameter containing:
tensor([0., 0., 0., 0., 0.], requires_grad=True)

Thank you for helping me understand these mechanics a bit better!