Loss not Backpropagating while implementing Hypernetwork in Pytorch

I am trying to use Hypernetwork to fine tune a pretrained model on a new set of speaker or data points.
The pretrained model consists of Conv 2d and TCN. Using Hypernetworks I am trying to predict the weights for the conv 2ds. The pretrained model is frozen and we are only updating the weights. The problem is that loss is not backpropagating through the parameters of the Primary Network. Primary Network is the Network that has parameters involving random embeddings supplied to the Hypernetwork as well as the weights and biases and other parameters of the two layer HyperNetwork.
I suspect the loss is not backpropagating due to the assign_weights function that we are using to modify the weights of the Original Model. Since our model is built with nn.conv 2d I find no other way to suuply the predicted weights to the parameters of the pretrained model.

The code having primary network and embeddings -

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from hypernet.hypernetwork_modules import HyperNetwork
from lipreading.utils import load_model, CheckpointSaver
from kd.model import lrwmodel
import torch.nn as nn
import torch

    
class Embedding(nn.Module):

    def __init__(self, z_num, z_dim):
        super(Embedding, self).__init__()

        self.z_list = nn.ParameterList()
        self.z_num = z_num # kernel size
        self.z_dim = z_dim # dimension of hypernetwork layer... nothing to do with primary network

        h,k = self.z_num # kernel size is stored inside h and k where h is heigt and k is width(???)
        
        # Initializing embedding for each index in a filter. If filter is 1x1 embedding is 1 x 64, if it is 4 x 4, embedding is 16 x 64
        for i in range(h):
            for j in range(k):
                self.z_list.append(Parameter(torch.fmod(torch.randn(self.z_dim).cuda(), 2)))


    def forward(self, hyper_net):
        ww = []
        _in, _out = self.z_num
        c=0
        for i in range(_out):
            w = []
            for j in range(_in):
                h_out = hyper_net(self.z_list[c])
                c+=1
                w.append(h_out)
            temp_w = torch.cat(w, dim=1)
            ww.append(temp_w)
        temp_ww = torch.cat(ww, dim=0)
        
        return temp_ww


class PrimaryNetwork(nn.Module):

    def __init__(self, z_dim=64,residual_hypernet=False,load_checkpoint_path=None,alpha=0.3):
        super(PrimaryNetwork, self).__init__()


        device='cuda'
        self.alpha=alpha
        self.residual_hypernet=residual_hypernet
        self.z_dim = z_dim
        self.hope = HyperNetwork(z_dim=self.z_dim)
        self.hope=self.hope.to(device)

        #represents the in and out channels of the network. Multiples of (64,64) since thats the model size.

        self.zs_size = [[1, 1],[1,2],[2,2],[2,4],[4,4],[4,8]]

        self.filter_size = [[64,64],[64,128],[128,128],[128,256],[256,256],[256,512]]


        # List that contains the embeddings to be given to the hypernetwork. We will preferably have to save this list somewhere.
        self.zs = nn.ModuleList()

        for i in range(len(self.zs_size)):
            self.zs.append(Embedding(self.zs_size[i], self.z_dim))

        self.model=lrwmodel()
        self.model=self.model.to(device)

        if(load_checkpoint_path):
            self.model = load_model(load_checkpoint_path, self.model, allow_size_mismatch=False)
            print("Loaded checkpoint and model from -- ",load_checkpoint_path)


        #self.final = nn.Linear(500,500)
        #print("Residual Hypernet is ",residual_hypernet)
        #self.model.eval()
        # for param in self.model.parameters():
        #     param.requires_grad = False
        self.weights_list=[]
    def assign_weights(self,weights_list):
        #print("length of weight ----",len(weights_list))
        layer_name = 'mini_resnet'
        
        i=0
        for name, param in self.model.named_parameters():
            
            if layer_name in name and 'weight' in name and len(list(param.shape))==4 and list(param.shape)[1] != 1 :
                
                if self.residual_hypernet == True:
                    param.data= self.alpha*weights_list[i]+ param.data
                else:
                    param.data=weights_list[i]
                i+=1
                

    ## PLAN FOR FORWARD -
    
    ## Get the embedding. Send the embeddings to the Hypernetwork. Let it return the weights. Keep the weights to a list and send it alltogeer in the forward function.

    def forward(self, x,lengths):        
        self.weights_list=[]
        for i in range(6):
            w1 = self.zs[i](self.hope)
            self.weights_list.append(w1)

        self.assign_weights(self.weights_list)
        with torch.no_grad():
            _,x=self.model(x,lengths)
            
        return x


There might be problems with the way we are assigning weights at each forward call due to which the loss is not backpropagating through the Primary Net parameters.

I want help to understand if there are wayouts like gradient copy or smarter weight updation that will keep the loss intact.

Most of the people using Hypernetworks do so in training but we want to do it in the pretrained stage. The simplest way is to assign weights through torch.functional.conv2d but that will make us to change the entire architecture as well as difficulty in loading the pretrained model.

Looking forward to the help.
Thanks.

You are executing the forward pass in a no_grad context, which disabled the gradient computation. I would thus assume you would receive an error when trying to call backward on the output or loss.

That is the pretrained model that I am calling inside the network. There is no error but when I print the gradient of the Primary Network parameters during training it appears as None. Basically I feel the assign_weights function is not differentiable. Is there any way out?

This method isn’t using any input and is thus not even attached to the computation graph potentially created by x. As described before, x is then also used in a no_grad context, so again no computation graph is created. I unfortunately don’t understand how it could work and what’s the use case is.