Weight initialization with nn.module.apply() in nn.Sequential()

hxuaj · February 13, 2019, 4:31pm

Hi,

I used torch.nn. Module.apply() to initialize the weights and bias for my nn.Sequential() model.
The code is shown below:

def random_weight(shape):
    if len(shape) == 2:  # FC weight
        fan_in = shape[0]
    else:
        fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]
    w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)
    w.requires_grad = True
    return w

def init_weights(m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d:
        m.weight.data = random_weight(m.weight.data.size())
model.apply(init_weights)

The problem is that the params will not be updated if I assign the initialization of params like this m.weight.data = random_weight(m.weight.data.size()). My guess is the weights are assigned every iteration, but I’m not sure.
So I’ve found a solution:

def init_weights(m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d:
        m.weight.data.copy_ = random_weight(m.weight.data.size())
model.apply(init_weights)

Simply add copy_ and the params start to get updated.
Could anyone help explain this? Thank you.

hxuaj · February 24, 2019, 6:34pm

The copy_ function should be:

m.weight.data.copy_(random_weight(m.weight.data.size()))

The weight shape of nn.Linear in PyTorch is (out_features, in_features)!
So in random_weighet, the fan_in is out_features.
The params didn’t update is because the initialization weights were divided by the number of out_features, which led gradient flow to vanish.