Learnable scalar weight in PyTorch and guarantee the sum of scalars is 1

Hello, maybe I miss already-post discussion regarding this title.
So, i have this code:

class MyModule(nn.Module):
    
    def __init__(self, channel, reduction=16, n_segment=8):
        super(MyModule, self).__init__()
        self.channel = channel
        self.reduction = reduction
        self.n_segment = n_segment
        
        self.conv1 = nn.Conv2d(in_channels=self.channel, out_channels=self.channel//self.reduction, kernel_size=1, bias=False)
        self.conv2 = nn.Conv2d(in_channels=self.channel, out_channels=self.channel//self.reduction, kernel_size=1, bias=False)
        self.conv3 = nn.Conv2d(in_channels=self.channel, out_channels=self.channel//self.reduction, kernel_size=1, bias=False)
        #whatever

        # learnable weight
        self.W_1 = nn.Parameter(torch.randn(1), requires_grad=True)
        self.W_2 = nn.Parameter(torch.randn(1), requires_grad=True)
        self.W_3 = nn.Parameter(torch.randn(1), requires_grad=True)

    def forward(self, x):
        
        # whatever
        
        ## branch1                
        bottleneck_1 = self.conv1(x)
        
        ## branch2
        bottleneck_2 = self.conv2(x)
        
        ## branch3                
        bottleneck_3 = self.conv3(x)
        
        ## summation
        output = self.avg_pool(self.W_1*bottleneck_1 + 
                          self.W_2*bottleneck_2 + 
                          self.W_3*bottleneck_3) 
        
        return output

As you see, 3 learnable scalars (W_1, W_2, and W_3) are used for weighting purpose. But, this approach will not guarantee that the sum of those scalars is 1. How to make the summation of my learnable scalars equals to 1? Thanks

I could think of the following way:

class MyModule(nn.Module):
    def __init__(self, channel, reduction=16, n_segment=8):
        ...
        self.W = nn.Parameter(torch.randn(3))
        ...

    def forward(self, x):
        # whatever
        ...

        w = self.W.softmax(-1)
        output = self.avg_pool(w[0]*bottleneck_1 + 
                               w[1]**bottleneck_2 + 
                               w[2]*bottleneck_3) 
        ...

1 Like

Hei @InnovArul , that’s genius approach,many thanks!

Hei again @InnovArul , when I print(w), I saw repeating value of w? Why is w not updating? I declare W like this: self.W = nn.Parameter(torch.randn(3), requires_grad=True)

Maybe it is learning equal weightage? What do you think?
I am not sure of your exact scenario.

A small correction. When you are using nn.Parameter, it automatically assigns requires_grad=True. So you don’t need to specify it manually.

@InnovArul , I want to implement like this:

image

What are the values getting printed btw? Are they add to 1? Maybe you have to train for longer time?

Also they seem to have shared the parameters between all 3 convs. Your code doesn’t seem to do that. I do not know if that’s a key point to consider.

What are the values getting printed btw?

print(w) after self.avg_pool gives me like this:
tensor([0.1944, 0.0687, 0.7370], device=‘cuda:0’, grad_fn=)
tensor([0.1981, 0.1346, 0.6673], device=‘cuda:0’, grad_fn=)
tensor([0.2181, 0.4030, 0.3789], device=‘cuda:0’, grad_fn=)
tensor([0.6972, 0.2325, 0.0704], device=‘cuda:0’, grad_fn=)
tensor([0.0573, 0.0222, 0.9205], device=‘cuda:0’, grad_fn=)
tensor([0.0967, 0.6594, 0.2438], device=‘cuda:0’, grad_fn=)
tensor([0.0540, 0.3833, 0.5628], device=‘cuda:0’, grad_fn=)
tensor([0.1071, 0.8304, 0.0625], device=‘cuda:0’, grad_fn=)
tensor([0.1369, 0.1455, 0.7176], device=‘cuda:0’, grad_fn=)
tensor([0.1141, 0.6166, 0.2694], device=‘cuda:0’, grad_fn=)
tensor([0.5121, 0.3077, 0.1802], device=‘cuda:0’, grad_fn=)
tensor([0.6380, 0.2686, 0.0934], device=‘cuda:0’, grad_fn=)
tensor([0.3382, 0.1036, 0.5582], device=‘cuda:0’, grad_fn=)
tensor([0.1021, 0.3302, 0.5676], device=‘cuda:0’, grad_fn=)
tensor([0.0386, 0.6035, 0.3578], device=‘cuda:0’, grad_fn=)
tensor([0.1217, 0.5602, 0.3181], device=‘cuda:0’, grad_fn=)
Epoch: [0][0/43008], lr: 0.00500 Time 3.649 (3.649) Data 0.115 (0.115) Loss 5.1573 (5.1573) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)
tensor([0.1944, 0.0687, 0.7370], device=‘cuda:0’, grad_fn=)
tensor([0.1981, 0.1346, 0.6673], device=‘cuda:0’, grad_fn=)
tensor([0.2181, 0.4030, 0.3789], device=‘cuda:0’, grad_fn=)
tensor([0.6972, 0.2325, 0.0704], device=‘cuda:0’, grad_fn=)
tensor([0.0573, 0.0222, 0.9205], device=‘cuda:0’, grad_fn=)
tensor([0.0967, 0.6594, 0.2438], device=‘cuda:0’, grad_fn=)
tensor([0.0540, 0.3833, 0.5628], device=‘cuda:0’, grad_fn=)
tensor([0.1071, 0.8304, 0.0625], device=‘cuda:0’, grad_fn=)
tensor([0.1369, 0.1455, 0.7176], device=‘cuda:0’, grad_fn=)
tensor([0.1141, 0.6166, 0.2694], device=‘cuda:0’, grad_fn=)
tensor([0.5121, 0.3077, 0.1802], device=‘cuda:0’, grad_fn=)
tensor([0.6380, 0.2686, 0.0934], device=‘cuda:0’, grad_fn=)
tensor([0.3382, 0.1036, 0.5582], device=‘cuda:0’, grad_fn=)
tensor([0.1021, 0.3302, 0.5676], device=‘cuda:0’, grad_fn=)
tensor([0.0386, 0.6035, 0.3578], device=‘cuda:0’, grad_fn=)
tensor([0.1217, 0.5602, 0.3181], device=‘cuda:0’, grad_fn=)
tensor([0.1944, 0.0687, 0.7370], device=‘cuda:0’, grad_fn=)
tensor([0.1981, 0.1346, 0.6673], device=‘cuda:0’, grad_fn=)
tensor([0.2181, 0.4030, 0.3789], device=‘cuda:0’, grad_fn=)
tensor([0.6972, 0.2325, 0.0704], device=‘cuda:0’, grad_fn=)
tensor([0.0573, 0.0222, 0.9205], device=‘cuda:0’, grad_fn=)
tensor([0.0967, 0.6594, 0.2438], device=‘cuda:0’, grad_fn=)
tensor([0.0540, 0.3833, 0.5628], device=‘cuda:0’, grad_fn=)
tensor([0.1071, 0.8304, 0.0625], device=‘cuda:0’, grad_fn=)
tensor([0.1369, 0.1455, 0.7176], device=‘cuda:0’, grad_fn=)
tensor([0.1141, 0.6166, 0.2694], device=‘cuda:0’, grad_fn=)
tensor([0.5121, 0.3077, 0.1802], device=‘cuda:0’, grad_fn=)
tensor([0.6380, 0.2686, 0.0934], device=‘cuda:0’, grad_fn=)
tensor([0.3382, 0.1036, 0.5582], device=‘cuda:0’, grad_fn=)
tensor([0.1021, 0.3302, 0.5676], device=‘cuda:0’, grad_fn=)
tensor([0.0386, 0.6035, 0.3578], device=‘cuda:0’, grad_fn=)
tensor([0.1217, 0.5602, 0.3181], device=‘cuda:0’, grad_fn=)
tensor([0.1944, 0.0687, 0.7370], device=‘cuda:0’, grad_fn=)
tensor([0.1981, 0.1346, 0.6673], device=‘cuda:0’, grad_fn=)
tensor([0.2181, 0.4030, 0.3789], device=‘cuda:0’, grad_fn=)
tensor([0.6972, 0.2325, 0.0704], device=‘cuda:0’, grad_fn=)
tensor([0.0573, 0.0222, 0.9205], device=‘cuda:0’, grad_fn=)
tensor([0.0967, 0.6594, 0.2438], device=‘cuda:0’, grad_fn=)
tensor([0.0540, 0.3833, 0.5628], device=‘cuda:0’, grad_fn=)
tensor([0.1071, 0.8304, 0.0625], device=‘cuda:0’, grad_fn=)
tensor([0.1369, 0.1455, 0.7176], device=‘cuda:0’, grad_fn=)
tensor([0.1141, 0.6166, 0.2694], device=‘cuda:0’, grad_fn=)
tensor([0.5121, 0.3077, 0.1802], device=‘cuda:0’, grad_fn=)
tensor([0.6380, 0.2686, 0.0934], device=‘cuda:0’, grad_fn=)
tensor([0.3382, 0.1036, 0.5582], device=‘cuda:0’, grad_fn=)
tensor([0.1021, 0.3302, 0.5676], device=‘cuda:0’, grad_fn=)
tensor([0.0386, 0.6035, 0.3578], device=‘cuda:0’, grad_fn=)
tensor([0.1217, 0.5602, 0.3181], device=‘cuda:0’, grad_fn=)
tensor([0.1944, 0.0687, 0.7370], device=‘cuda:0’, grad_fn=)
tensor([0.1981, 0.1346, 0.6673], device=‘cuda:0’, grad_fn=)
tensor([0.2181, 0.4030, 0.3789], device=‘cuda:0’, grad_fn=)
tensor([0.6972, 0.2325, 0.0704], device=‘cuda:0’, grad_fn=)
tensor([0.0573, 0.0222, 0.9205], device=‘cuda:0’, grad_fn=)
tensor([0.0967, 0.6594, 0.2438], device=‘cuda:0’, grad_fn=)
tensor([0.0540, 0.3833, 0.5628], device=‘cuda:0’, grad_fn=)
tensor([0.1071, 0.8304, 0.0625], device=‘cuda:0’, grad_fn=)
tensor([0.1369, 0.1455, 0.7176], device=‘cuda:0’, grad_fn=)
tensor([0.1141, 0.6166, 0.2694], device=‘cuda:0’, grad_fn=)
tensor([0.5121, 0.3077, 0.1802], device=‘cuda:0’, grad_fn=)
tensor([0.6380, 0.2686, 0.0934], device=‘cuda:0’, grad_fn=)
tensor([0.3382, 0.1036, 0.5582], device=‘cuda:0’, grad_fn=)
tensor([0.1021, 0.3302, 0.5676], device=‘cuda:0’, grad_fn=)
tensor([0.0386, 0.6035, 0.3578], device=‘cuda:0’, grad_fn=)
tensor([0.1217, 0.5602, 0.3181], device=‘cuda:0’, grad_fn=)
tensor([0.1944, 0.0687, 0.7370], device=‘cuda:0’, grad_fn=)
tensor([0.1981, 0.1346, 0.6673], device=‘cuda:0’, grad_fn=)
tensor([0.2181, 0.4030, 0.3789], device=‘cuda:0’, grad_fn=)
tensor([0.6972, 0.2325, 0.0704], device=‘cuda:0’, grad_fn=)
tensor([0.0573, 0.0222, 0.9205], device=‘cuda:0’, grad_fn=)
tensor([0.0967, 0.6594, 0.2438], device=‘cuda:0’, grad_fn=)
tensor([0.0540, 0.3833, 0.5628], device=‘cuda:0’, grad_fn=)
tensor([0.1071, 0.8304, 0.0625], device=‘cuda:0’, grad_fn=)
tensor([0.1369, 0.1455, 0.7176], device=‘cuda:0’, grad_fn=)
tensor([0.1141, 0.6166, 0.2694], device=‘cuda:0’, grad_fn=)
tensor([0.5121, 0.3077, 0.1802], device=‘cuda:0’, grad_fn=)
tensor([0.6380, 0.2686, 0.0934], device=‘cuda:0’, grad_fn=)
tensor([0.3382, 0.1036, 0.5582], device=‘cuda:0’, grad_fn=)
tensor([0.1021, 0.3302, 0.5676], device=‘cuda:0’, grad_fn=)
tensor([0.0386, 0.6035, 0.3578], device=‘cuda:0’, grad_fn=)
tensor([0.1217, 0.5602, 0.3181], device=‘cuda:0’, grad_fn=)

Are they add to 1?

Yes

Maybe you have to train for longer time?

I will try, I only run 3sec before

Also they seem to have shared the parameters between all 3 convs

My mistake to write my code in this discussion, I actually have only 1 conv, not 3 conv as in the question

@InnovArul , still, it prints repeating value after running 15minutes

Can you paste those values here?

tensor([0.5125, 0.2359, 0.2516], device=‘cuda:0’, grad_fn=)
tensor([0.7157, 0.2214, 0.0629], device=‘cuda:0’, grad_fn=)
tensor([0.2182, 0.4743, 0.3075], device=‘cuda:0’, grad_fn=)
tensor([0.5767, 0.1281, 0.2952], device=‘cuda:0’, grad_fn=)
tensor([0.0621, 0.7963, 0.1416], device=‘cuda:0’, grad_fn=)
tensor([0.1327, 0.1156, 0.7517], device=‘cuda:0’, grad_fn=)
tensor([0.2615, 0.5469, 0.1917], device=‘cuda:0’, grad_fn=)
tensor([0.0467, 0.4114, 0.5419], device=‘cuda:0’, grad_fn=)
tensor([0.6102, 0.0481, 0.3417], device=‘cuda:0’, grad_fn=)
tensor([0.4151, 0.4901, 0.0948], device=‘cuda:0’, grad_fn=)
tensor([0.1035, 0.1519, 0.7446], device=‘cuda:0’, grad_fn=)
tensor([0.2012, 0.6926, 0.1062], device=‘cuda:0’, grad_fn=)
tensor([0.6431, 0.0715, 0.2854], device=‘cuda:0’, grad_fn=)
tensor([0.2183, 0.6838, 0.0978], device=‘cuda:0’, grad_fn=)
tensor([0.1041, 0.5261, 0.3698], device=‘cuda:0’, grad_fn=)
tensor([0.2220, 0.1611, 0.6169], device=‘cuda:0’, grad_fn=)
Epoch: [0][0/86017], lr: 0.00500 Time 3.243 (3.243) Data 0.078 (0.078) Loss 5.1598 (5.1598) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)
tensor([0.5125, 0.2359, 0.2516], device=‘cuda:0’, grad_fn=)
tensor([0.7157, 0.2214, 0.0629], device=‘cuda:0’, grad_fn=)
tensor([0.2182, 0.4743, 0.3075], device=‘cuda:0’, grad_fn=)
tensor([0.5767, 0.1281, 0.2952], device=‘cuda:0’, grad_fn=)
tensor([0.0621, 0.7963, 0.1416], device=‘cuda:0’, grad_fn=)
tensor([0.1327, 0.1156, 0.7517], device=‘cuda:0’, grad_fn=)
tensor([0.2615, 0.5469, 0.1917], device=‘cuda:0’, grad_fn=)
tensor([0.0467, 0.4114, 0.5419], device=‘cuda:0’, grad_fn=)
tensor([0.6102, 0.0481, 0.3417], device=‘cuda:0’, grad_fn=)
tensor([0.4151, 0.4901, 0.0948], device=‘cuda:0’, grad_fn=)
tensor([0.1035, 0.1519, 0.7446], device=‘cuda:0’, grad_fn=)
tensor([0.2012, 0.6926, 0.1062], device=‘cuda:0’, grad_fn=)
tensor([0.6431, 0.0715, 0.2854], device=‘cuda:0’, grad_fn=)
tensor([0.2183, 0.6838, 0.0978], device=‘cuda:0’, grad_fn=)
tensor([0.1041, 0.5261, 0.3698], device=‘cuda:0’, grad_fn=)
tensor([0.2220, 0.1611, 0.6169], device=‘cuda:0’, grad_fn=)
tensor([0.5125, 0.2359, 0.2516], device=‘cuda:0’, grad_fn=)
tensor([0.7157, 0.2214, 0.0629], device=‘cuda:0’, grad_fn=)
tensor([0.2182, 0.4743, 0.3075], device=‘cuda:0’, grad_fn=)
tensor([0.5767, 0.1281, 0.2952], device=‘cuda:0’, grad_fn=)
tensor([0.0621, 0.7963, 0.1416], device=‘cuda:0’, grad_fn=)
tensor([0.1327, 0.1156, 0.7517], device=‘cuda:0’, grad_fn=)
tensor([0.2615, 0.5469, 0.1917], device=‘cuda:0’, grad_fn=)
tensor([0.0467, 0.4114, 0.5419], device=‘cuda:0’, grad_fn=)
tensor([0.6102, 0.0481, 0.3417], device=‘cuda:0’, grad_fn=)
tensor([0.4151, 0.4901, 0.0948], device=‘cuda:0’, grad_fn=)
tensor([0.1035, 0.1519, 0.7446], device=‘cuda:0’, grad_fn=)
tensor([0.2012, 0.6926, 0.1062], device=‘cuda:0’, grad_fn=)
tensor([0.6431, 0.0715, 0.2854], device=‘cuda:0’, grad_fn=)
tensor([0.2183, 0.6838, 0.0978], device=‘cuda:0’, grad_fn=)
tensor([0.1041, 0.5261, 0.3698], device=‘cuda:0’, grad_fn=)
tensor([0.2220, 0.1611, 0.6169], device=‘cuda:0’, grad_fn=)
tensor([0.5125, 0.2359, 0.2516], device=‘cuda:0’, grad_fn=)
tensor([0.7157, 0.2214, 0.0629], device=‘cuda:0’, grad_fn=)
tensor([0.2182, 0.4743, 0.3075], device=‘cuda:0’, grad_fn=)
tensor([0.5767, 0.1281, 0.2952], device=‘cuda:0’, grad_fn=)
tensor([0.0621, 0.7963, 0.1416], device=‘cuda:0’, grad_fn=)
tensor([0.1327, 0.1156, 0.7517], device=‘cuda:0’, grad_fn=)
tensor([0.2615, 0.5469, 0.1917], device=‘cuda:0’, grad_fn=)
tensor([0.0467, 0.4114, 0.5419], device=‘cuda:0’, grad_fn=)
tensor([0.6102, 0.0481, 0.3417], device=‘cuda:0’, grad_fn=)
tensor([0.4151, 0.4901, 0.0948], device=‘cuda:0’, grad_fn=)
tensor([0.1035, 0.1519, 0.7446], device=‘cuda:0’, grad_fn=)
tensor([0.2012, 0.6926, 0.1062], device=‘cuda:0’, grad_fn=)
tensor([0.6431, 0.0715, 0.2854], device=‘cuda:0’, grad_fn=)
tensor([0.2183, 0.6838, 0.0978], device=‘cuda:0’, grad_fn=)
tensor([0.1041, 0.5261, 0.3698], device=‘cuda:0’, grad_fn=)
tensor([0.2220, 0.1611, 0.6169], device=‘cuda:0’, grad_fn=)
tensor([0.5125, 0.2359, 0.2516], device=‘cuda:0’, grad_fn=)
tensor([0.7157, 0.2214, 0.0629], device=‘cuda:0’, grad_fn=)

There should be grad_fn=SoftmaxBackward, but when I paste here, grad_fn is empty

Can you point out the repeating values?

@InnovArul , I also print model.module.base_model.layer1[0].mve.W.data after optimizer.step() and the tensor values dont change at all after several iteration. Really need your help

I am not sure how they are repeating exactly when you train with shuffled training data.
Also, Try to not regularize this self.W to see its effect.

Atleast this part of the code looks fine to me. I do not know about your other parts of the code.

By the way, when I print modules using for m in self.modules(): blablabla the self.W is not listed in the list. What a strange.

UPDATE:
@InnovArul , Turns out, I need to wrap my self.W in nn.ParameterList() to be recognized in model.modules()

UPDATE2:
It’s updating NOW, thxxxxxxx!!!

Do you mean, using nn.ParameterList() works, but nn.Parameter() does not work?
In my understanding, there is no difference between these two in this scenario.

Maybe if you mention how you solved it, it will be helpful to others.

@InnovArul, sorry for replying late.
So, in my code, I have a function to assign different learning rate and weight decay to layers so that all layers dont have same LR and WD. For this, I call model.modules to iterate all modules/layers inside my model. Unfortunately, self.W cannot be seen. Searching on the Internet, several people used nn.ParameterList or nn.ParameterDict so that nn.Parameter declared in class MyModule can be seen when using model.modules .

I understand. With self.W being nn.Parameter(), you can access it by model.named_parameters() or model.parameters().

1 Like