Channels wise Linear Module

vHitsuji · July 26, 2019, 7:10am

Hi everyone,
in my case I have to apply many tensors (of same size) to many Linear Modules (respectively).
These computations are independent and the order doesn’t matter.
To make the Gpu the most efficiently, I wanted to apply these computations using the least number of calls to the Gpu.
I decided to design a Channels wise Linear Module which is base on the Pytorch’s Linear Module :

class multiChannelsLinear(nn.Module):

    __constants__ = ['bias']

    def __init__(self, channels, in_features, out_features, bias=True):
        super(multiChannelsLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.channels = channels
        self.weight = nn.Parameter(torch.Tensor(channels, out_features, in_features))
        if bias:
            self.bias = nn.Parameter(torch.Tensor(channels, out_features))
        else:
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self):
        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            nn.init.uniform_(self.bias, -bound, bound)

    def forward(self, input):
        input = input.transpose(0, 2).transpose(0, 1)
        output = self.weight.matmul(input)
        output = output.transpose(0, 1).transpose(0, 2)
        if self.bias is not None:
            output += self.bias
        ret = output
        return ret

    def extra_repr(self):
        return 'channels={}, in_features={}, out_features={}, bias={}'.format(
            self.channels, self.in_features, self.out_features, self.bias is not None
        )

It seems that the code doesn’t produce the same thing than using many Linear Modules.
I cannot find were I’m wrong,
By the way, why this kind of module does not exist in Pytorch?
I think it would be easier to reach better performance when we want to apply the same kind of transformation on different channels.