How to define that a calculation should happen at once at the beginning only inside forward call of custom network?

class Self_Attn(nn.Module):
    """ Self attention Layer"""
    def __init__(self, in_dim):
        super().__init__()
        
        # Construct the conv layers
        self.query_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//2 , kernel_size= 1)
        self.key_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//2 , kernel_size= 1)
        self.value_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim , kernel_size= 1)
        
        # Initialize gamma as 0
        self.gamma = nn.Parameter(torch.zeros(1))
        self.softmax  = nn.Softmax(dim=-1)
        
    def forward(self,x):
        """
            inputs :
                x : input feature maps( B * C * W * H)
            returns :
                out : self attention value + input feature 
                attention: B * N * N (N is Width*Height)
        """
        m_batchsize,C,width ,height = x.size()
        
        proj_query  = self.query_conv(x).view(m_batchsize, -1, width*height).permute(0,2,1) # B * N * C
        proj_key =  self.key_conv(x).view(m_batchsize, -1, width*height) # B * C * N
        energy =  torch.bmm(proj_query, proj_key) # batch matrix-matrix product
        
        attention = self.softmax(energy) # B * N * N
        proj_value = self.value_conv(x).view(m_batchsize, -1, width*height) # B * C * N
        out = torch.bmm(proj_value, attention.permute(0,2,1)) # batch matrix-matrix product
        out = out.view(m_batchsize,C,width,height) # B * C * W * H
        
        # Add attention weights onto input
        out = self.gamma*out + x
        return out, attention

This code defines self-attention block. If you see the code

m_batchsize,C,width ,height = x.size()

inside the forward call it is just extracting the size info of the batch. We only need to find this info at the beginning of the training. Later we can use the same value. Is there any way to specify that a particular code block should run at the beginning only, inside the forward call?

You could define a specific class method for it and store the shape information as attributes if you are sure they will never change, and use these attributes in the forward method. E.g. something like this should work:

class Self_Attn(nn.Module):
    """ Self attention Layer"""
    def __init__(self, in_dim):
        super().__init__()
        
        # Construct the conv layers
        self.query_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//2 , kernel_size= 1)
        self.key_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//2 , kernel_size= 1)
        self.value_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim , kernel_size= 1)
        
        # Initialize gamma as 0
        self.gamma = nn.Parameter(torch.zeros(1))
        self.softmax  = nn.Softmax(dim=-1)

    def calc_shapes(self, x):
        m_batchsize, C, width, height = x.size()
        self.batchsize = m_batchsize
        self.C = C
        self.width = width
        self.height = height

    def forward(self,x):        
        proj_query  = self.query_conv(x).view(self.batchsize, -1, self.width*self.height).permute(0,2,1) # B * N * C
        proj_key =  self.key_conv(x).view(self.batchsize, -1, self.width*self.height) # B * C * N
        energy =  torch.bmm(proj_query, proj_key) # batch matrix-matrix product
        
        attention = self.softmax(energy) # B * N * N
        proj_value = self.value_conv(x).view(self.batchsize, -1, self.width*self.height) # B * C * N
        out = torch.bmm(proj_value, attention.permute(0,2,1)) # batch matrix-matrix product
        out = out.view(self.batchsize, self.C, self.width, self.height) # B * C * W * H
        
        # Add attention weights onto input
        out = self.gamma*out + x
        return out, attention

model = Self_Attn()
model.calc_shapes(input)
output = model(data)

Alternatively, you could of course also calculate this shape information in the __init__ method, if you pass an input to it.

1 Like

Thank you so much for your reply. This method works when we know about the input already. If I want to use this self-attention block inside of any other bigger network at multiple locations, I need to query different class methods every time to estimate the input shape. For each query, I need to build the network up to that point.

I am expecting way like the forward class method which can automatically take the input x from the training batch for the first time only and update self.weight and self.height

In that case a condition using an attribute should work:

def forward(self,x):
    if not self.initialized:
        self.calc_shapes(x)
        self.initialized = True
    ...
1 Like

It is working.

Furthermore, I got this quirky idea of creating a convolution module where we don’t need to specify the input_channel explicitly and initiating module inside first forward pass.

class convolution(nn.Module):

    def __init__(self):
         super().__init__()
         self.initialized = True

    def forward(self,x):
         if self.initialized:
              in_channel = x.shape[1]
              self.conv = nn.Conv2d(in_channel, 32, 3)
              self.initialized = False
        
         x = self.conv(x)

         return x

Will it work as intended?

I checked the above concept of initiating a trainable module inside the first forward pass only with the custom network. Please give feedback on its correctness.

from torch import nn
import torch


class convolution(nn.Module):
    def __init__(self):
        super().__init__()
        self.initialized = True

    def forward(self, x):
        if self.initialized:
            in_channel = x.shape[1]
            self.conv = nn.Conv2d(in_channel, 32, 3)
            self.initialized = False

        x = self.conv(x)

        return x


class network(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = convolution()
        self.conv2 = convolution()
        self.conv3 = convolution()

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        return x


net = network()

print("*" * 20, "Before warm up", "*" * 20)
print("network:\n", net)

# Not going to print anything
for name, param in net.named_parameters():
    if param.requires_grad:
        print(name)

# warm up
rand_input_1 = torch.rand((16, 64, 7, 7))
a1 = net(rand_input_1)

print("*" * 20, "After warm up", "*" * 20)
print(net)
for name, param in net.named_parameters():
    if param.requires_grad:
        print(name, param.data.sum())

# random input with diffrent channel size will throw error
rand_input_2 = torch.rand((16, 64, 7, 7))
a2 = net(rand_input_2)

print("\nParameters is not initiated again in second call. It remains with same weight\n")
for name, param in net.named_parameters():
    if param.requires_grad:
        print(name, param.data.sum())

The output of the above snippet

******************** Before warm up ********************
network:
 network(
  (conv1): convolution()
  (conv2): convolution()
  (conv3): convolution()
)
******************** After warm up ********************
network(
  (conv1): convolution(
    (conv): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1))
  )
  (conv2): convolution(
    (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
  )
  (conv3): convolution(
    (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
  )
)
conv1.conv.weight tensor(1.9705)
conv1.conv.bias tensor(0.1167)
conv2.conv.weight tensor(-6.0852)
conv2.conv.bias tensor(0.0791)
conv3.conv.weight tensor(-2.9391)
conv3.conv.bias tensor(0.1060)
Parameters is not initiated again in second call. It remains with same weight
conv1.conv.weight tensor(1.9705)
conv1.conv.bias tensor(0.1167)
conv2.conv.weight tensor(-6.0852)
conv2.conv.bias tensor(0.0791)
conv3.conv.weight tensor(-2.9391)
conv3.conv.bias tensor(0.1060)

This network can automatically configure itself after the first input data injection

Your approach could work theoretically, but I would be careful about the overall workflow in your training script.
Usually you would setup the model and pass all parameters to an optimizer before the first forward pass.
This won’t work properly in your approach, since not all modules are initialized yet, so you would need to perform an example forward pass before creating the optimizer.
While it might work for your project, other users (using your model) might easily run into issues, where the model doesn’t train, since the parameters were lazily initialized.

1 Like

Thank you for your feedback.
I agree with your point.