Train a model to output weights of another model, and use the other model just as function evaluation

I have 2 models, A and B.
A(x1)=Weights of B
B(x2)=Final output

A is trainable
B is not trainable (I just want to upload the outputs of A into B and infer)

Problem I am facing: Output of A is torch.tensor. While setting the weights of B, I had to slice the output tensor of A. However, I am losing the gradient flow, from final loss to weights of A, hence there is no training happening. How do I implement the idea or correct my code?

My Source-Code:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from torch.autograd import Variable
import numpy as np

class Hyper_Model(nn.Module):

    def __init__(self):

        super(Hyper_Model, self).__init__()
        self.layers = nn.Sequential(nn.Linear(1,32),

    def forward(self,param):        
        param_ = self.layers(param)            
        return param_`Preformatted text`

class Main_Model(nn.Module):

    def __init__(self):

        super(Main_Model, self).__init__()
        self.linear1 = nn.Linear(2,8)
        self.linear2 = nn.Linear(8,8)
        self.linear3 = nn.Linear(8,8)
        self.out = nn.Linear(8,1)

    def forward(self,param_,x):
        self.linear1.weight = torch.nn.Parameter(param_[0,:16].view(8,2))
        self.linear2.weight = torch.nn.Parameter(param_[0,24:88].view(8,8))
        self.linear3.weight = torch.nn.Parameter(param_[0,96:160].view(8,8))
        self.linear1.bias = torch.nn.Parameter(param_[0,16:24].view(8))
        self.linear2.bias = torch.nn.Parameter(param_[0,88:96].view(8))
        self.linear3.bias = torch.nn.Parameter(param_[0,160:168].view(8))
        self.out.weight = torch.nn.Parameter(param_[0,168:176].view(1,8))
        self.out.bias = torch.nn.Parameter(param_[0,176:].view(1))

        self.linear1.weight.requires_grad = False
        self.linear2.weight.requires_grad = False
        self.linear3.weight.requires_grad = False        
        self.linear1.bias.requires_grad = False
        self.linear2.bias.requires_grad = False
        self.linear3.bias.requires_grad = False
        self.out.weight.requires_grad =  False
        self.out.bias.requires_grad =  False

        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = F.relu(self.linear3(x)) 
        x = self.out(x)
        return x

x = torch.tensor([1.0,2.0,3.0],requires_grad=True).view(3,1)
t = torch.tensor([1.0,1.5,2.0],requires_grad=True).view(3,1)
param = torch.tensor([-0.01]).view(1,1)
X =[x,t],dim=1)
Y = torch.tensor([5.0,6.0,9.0]).view(3,1)
h = Hyper_Model()
m = Main_Model()
opt = torch.optim.Adam(list(h.parameters()), lr=0.001)
loss_func = nn.MSELoss()

for i in range(10):
    param_ = h(param)   

    out = m(param_,X)
    loss = loss_func(out,Y)



0 tensor(46.0043, grad_fn=)
1 tensor(46.0043, grad_fn=)
2 tensor(46.0043, grad_fn=)
3 tensor(46.0043, grad_fn=)
4 tensor(46.0043, grad_fn=)
5 tensor(46.0043, grad_fn=)
6 tensor(46.0043, grad_fn=)
7 tensor(46.0043, grad_fn=)
8 tensor(46.0043, grad_fn=)
9 tensor(46.0043, grad_fn=)

Maybe helpful. I recently encounter a repo which does similar thing (predict the weight of another network) here. This is how they do for a FL layer, inheriting from MetaModule from the package torchmeta

1 Like

Hi Ritam!

As I understand it, you do not want to directly train the weights of B, but
you do want to train the weights of A so that it produces values that work
better when used as the weights of B.

(As an aside, don’t use torch.autograd.Variable. It is deprecated.
Just use regular pytorch tensors.)

Don’t create instances of Linear in Main_Model (as I understand it,
“Model B”). (So you won’t be setting the weights of instances of Linear
to the weight-values produced by “Model A” and you won’t be applying
those instances of Linear to x.)

Instead, apply the non-class, functional form of Linear to x passing in
the weight-values produces by “Model A” as arguments:

    def forward (self, param_, x):
        x = F.relu (F.linear (x, weight = param_[0, :16].view (8, 2), bias = param_[0, 16:24].view (8))
        return x

F.Linear is now just an ordinary pytorch tensor function. The gradients
produced by calling .backward() on the loss calculated from the output
of “Model B” will backpropagate properly through the calls to F.linear()
in “Model B”'s forward() function and then back propagate through the
“Model A” that produced the weight-values, producing gradients for the
weights in “Model A” which will then be optimized when you call opt.step().


K. Frank

1 Like

Thank you very much. I am working on something very similar to the idea in the Repo you just shared. It was very helpful!

@KFrank Thank you very much. This is precisely what I was looking for, and is working very smoothly for my use case! Just a small follow-up, using .nn modules for proper backprop training is recommended, and .nn.functional modules are suitable for function evaluations without gradients. Will that be a crude but suitable summary?