Numerical Differences between builtin function and re-implemented ones in python

matthias.l · October 20, 2019, 4:55pm

Hi!
I am new to pytorch and my model contains a bi-linear layer (= two inputs + one bias).
Therefore I implemented a simple module:

class Bilinear(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Bilinear, self).__init__()
        self.W_a = nn.Parameter(torch.Tensor(input_size, input_size))
        self.W_b = nn.Parameter(torch.Tensor(hidden_size, input_size))
        self.b = nn.Parameter(torch.Tensor(input_size))

    def forward(self, x, h):
        return self.W_a.t().matmul(x) + self.b.t() + self.W_b.t().matmul(h)

and another module using builtin Linear modules which should be mathematically the same:

class Bilinear(nn.Module):

    def __init__(self, input_size, hidden_size):
        super(Bilinear, self).__init__()
        self.linear_i = nn.Linear(input_size, input_size)
        self.linear_h = nn.Linear(input_size, hidden_size, bias=False)

    def forward(self, x, h):
        return self.linear_i(x) + self.linear_h(h)

Both running on CPU.

Can someone explain me why the outcome is that different for that modules? (Sorry but the complete model would be too complex to explain here).
The second one is much more stable. The first one sometimes leads to a loss (with NLLLoss) of nan.

I don’t want you to help me with my concrete problem, but helping me understand what is under the hood of pytorch which leads to high discrepancies of built-in vs manually re-built modules and how to avoid common pitfalls when writing low-level modules in pytorch.

Thank you very much!

ptrblck · October 21, 2019, 11:45pm

Did you initialize both models with the same values?
Note that torch.Tensor creates a tensor with uninitialized memory, so that you might get arbitrary values (including Infs and NaNs).

I would recommend to initialize your custom model with some initialization method from torch.nn.init to make sure your parameters are in a reasonable range.

Also, if you compare the outputs of both approaches, make sure to use the same values (copy the parameter values from the module model to your custom model via .copy_).

matthias.l · October 26, 2019, 1:05pm

Now I have the following implementation:

class Bilinear(nn.Module):
    '''
    See: torch/csrc/api/src/nn/modules/linear.cpp
    '''

    def __init__(self, input_size, hidden_size):
        super(Bilinear, self).__init__()
        stdv = input_size**-0.5

        W_a = torch.empty(input_size, input_size)
        nn.init.uniform_(W_a, -stdv, stdv)
        self.W_a = nn.Parameter(W_a)

        W_b = torch.empty(hidden_size, input_size)
        nn.init.uniform_(W_b, -stdv, stdv)
        self.W_b = nn.Parameter(W_b)
        self.b = nn.Parameter(torch.empty(input_size))

    def forward(self, x, h):
        return self.W_a.t().matmul(x) + self.b.t() + self.W_b.t().matmul(h)

which works as expected. Similar to the torch implementation of Linear but with bias being initialized with zeros.