Propagate gradient to parameters not directly involved in computation

mkorycinski · February 22, 2024, 11:26am

Hi,

I am designing a NN layer which uses a function specified by a set of parameters. These parameters, which shall be updated in backprop, are not directly involved in logits computation. Hence, they are not updated out of the box.

Is there a way to pass a gradient through them explicitly?

ptrblck · February 22, 2024, 10:46pm

You can assign the gradients to parameters directly via:

param.grad = new_grad

and the optimizer will use this .grad attribute to update the param.
However, the more interesting question might be how these new gradients are calculated if this parameter was never used.
Could you describe your use case a bit more and in particular why the parameter was never used but should be updated (if it’s never used it would not have any effect on the model output and thus loss).

mkorycinski · February 27, 2024, 6:17am

Hi,

Thanks for your interest.

In general, I would like to create a layer, which uses wavelet transformation on an NN layer input. Hence, I thought it would be a good idea to set wavelet filter values as layer parameters. However once the wavelet is computed (from filters, requires numpy objects), the “connection” between logits and params is lost.

Perhaps there is a better way to achieve that (even with canonical 1D convolution), but I thought that creating a layer by myself would be a good exercise.

ptrblck · February 27, 2024, 4:20pm

I guess you are detaching the computed parameter by wrapping and assigning it via:

param = calculate_param_in_differentiable_way(...)
self.param = nn.Parameter(param)

If so, try to use the calculated param directly via the functional API. E.g. instead of assigning it to the .weight parameter of a layer and calling it out = layer(x), use out = F.conv2d(input, param) (if you are using nn.Conv2d as the layer).

KFrank · February 28, 2024, 3:19am

Hi Mateusz!

Let me clarify what you are asking as I am not certain I understand.

When you say “wavelet,” do you mean the “wavelet filter values” that form
the “layer parameters” of your wavelet-transformation layer? Or is your
“wavelet” the output of your wavelet-transformation layer?

I am guessing the second. That is you create a layer whose trainable
parameters are “wavelet filter values” that are packaged as Parameters.
This layer takes some input and computes from it some output (that you
call a wavelet) and this output depends on the trainable wavelet filter values.
The problem is that in order to compute the layer’s output (the “wavelet”)
you require “numpy objects” which I assume means that you use numpy
to perform part of the layer’s computation.

If this is the case, then the “connection” between the layer’s output and the
layer’s trainable parameters (as well as the layer’s input, if that matters) will,
indeed, be lost. (This “connection” is often referred to as the “computation
graph.”)

This is because when you perform pytorch tensor operations (on a tensor
that has requires_grad = True), the tensor operations perform some
bookkeeping and possibly save some intermediate results that are used
by the autograd framework to perform backpropagation. Numpy knows
nothing about autograd and doesn’t perform any of the bookkeeping needed
for backpropagation.

There are two ways to address this issue:

First, you could rewrite your layer to use pure pytorch tensor operations,
replacing any computations performed by numpy with equivalent pytorch
tensor computations. If you use pytorch for everything, the computation
graph for the whole computation performed by your layer will be created,
and you will be able backpropagate from the layer’s output to the layer’s
trainable parameters (as well as to the layer’s input, if that matters).

Second, you could wrap the computations performed in numpy in a
custom autograd function. However, you then have to write a backward()
method for your custom Function that, roughly speaking, computes the
gradient of your Function’s forward() method.

So the practical question is whether it’s easier to rewrite your layer’s forward
pass with pure pytorch tensor operations, leaving out numpy, or whether it’s
easier to implement your layer’s backward pass (“gradient”) “by hand” (where
you would be allowed to use numpy or other tools to implement part of the
backward-pass computation).

Best.

K. Frank

mkorycinski · April 16, 2024, 8:16am

Thank you for your responses. What I was doing was not correct in any way. I wanted to create a wavelet, storing low and high pass filter values as parameters. However, they were used to create a wavelet object from an external library, hence the parameters were not used at all in a forward pass (well, not directly).

Now I’ve rewritten the whole thing to operate on tensors, without libraries external to PyTorch. Hence the definition of my layer:

class WaveletLayer(nn.Module):
    
    def __init__(self, input_size: int, filter_size: int = 2, wavelet_name: str = None, scale: float=math.sqrt(0.5)):
        super().__init__()
        self.input_size = input_size
        self.scale = scale
        
        if not wavelet_name:
            low_pass, high_pass = self.initialize_random(filter_size=filter_size)
        else:
            low_pass, high_pass = self.initialize_from_existing(name=wavelet_name)
        
        self.low_pass = nn.Parameter(torch.Tensor(low_pass.shape))
        self.high_pass = nn.Parameter(torch.Tensor(high_pass.shape))
        
        self.low_pass.data = low_pass
        self.high_pass.data = high_pass
        
        self.low_pass_out_size = self.input_size + len(self.low_pass) - 1
        self.high_pass_out_size = self.input_size + len(self.high_pass) - 1
        
    def initialize_from_existing(self, name: str):
        w = pywt.Wavelet(name)
        low_pass = torch.Tensor(w.dec_lo)
        high_pass = torch.Tensor(w.dec_hi)
        return low_pass, high_pass
    
    def initialize_random(self, filter_size: int):
        low_pass = torch.Tensor(filter_size)
        high_pass = torch.Tensor(filter_size)
        
        low_pass = nn.init.uniform_(low_pass, a=-1, b=1)
        high_pass = nn.init.uniform_(high_pass, a=-1, b=1)
        return low_pass, high_pass

    def convolve(self, x: torch.Tensor, kernel=torch.Tensor):
        kernel = torch.flip(kernel, [0])
        k_l = len(kernel)
        pad_size = k_l - 1
        
        p_x = F.pad(x, (pad_size, pad_size), value=0)
        result = p_x.clone()[...,:-pad_size]
        for bidx in range(result.shape[0]):
            for i in range(len(p_x)-pad_size):
                result[bidx, i] = torch.dot(p_x[i:i+k_l], kernel)
        return result
    
    def forward(self, x: torch.Tensor):
        y_low = self.convolve(x, self.low_pass) * self.scale
        y_high = self.convolve(x, self.high_pass) * self.scale
        return y_low, y_high

The model itself:

class WaveletNet(nn.Module):
    def __init__(self, input_size: int):
        super().__init__()
        
        self.wavelet_layer = WaveletLayer(input_size=input_size, filter_size=8)
        self.a_fc = nn.Linear(self.wavelet_layer.low_pass_out_size, 1000)
        self.d_fc = nn.Linear(self.wavelet_layer.high_pass_out_size, 1000)
        self.dense = nn.Linear(2000, 512)
        self.output = nn.Linear(512, 1)
        
        self.activation = nn.ReLU()
        
    def forward(self, x):
        cA, cD = self.wavelet_layer(x)
        
        cA = self.activation(cA)
        cd = self.activation(cD)
    
        x_cA = self.a_fc(cA)
        x_cD = self.d_fc(cD)
        
        x = torch.cat([x_cA, x_cD], dim=1)

        x = self.activation(x)
        
        x = self.dense(x)
        x = self.activation(x)

        x = self.output(x)

        return x

Before running forward pass I’ve printed some information about low pass filter and one of the “canonical” layers to check whether gradients are computing. And they are not, but just for my layer.

Before training:

Low pass: Parameter containing:
tensor([-0.0106,  0.0329,  0.0308, -0.1870, -0.0280,  0.6309,  0.7148,  0.2304],
       requires_grad=True)
Low pass grads: None
a_fc: Parameter containing:
tensor([[-9.3954e-04, -1.8090e-03,  7.7439e-04,  ..., -2.5442e-05,
          2.1742e-04, -7.0910e-04],
        [-1.7179e-03,  6.8478e-04, -1.3824e-04,  ..., -1.6424e-03,
          1.8740e-03, -8.2803e-04],
        [ 2.6315e-04, -2.4171e-04,  8.3137e-04,  ..., -1.2487e-03,
          1.1330e-03, -1.0383e-03],
        ...,
        [-1.1402e-03,  3.6352e-04, -7.1407e-04,  ..., -1.1429e-03,
          7.8917e-04,  1.8379e-03],
        [-1.9176e-03,  7.7521e-04, -8.4419e-04,  ..., -1.6644e-03,
          1.7614e-03,  2.7056e-04],
        [ 1.2752e-03, -1.0973e-03,  1.5253e-03,  ..., -8.3440e-04,
         -2.0271e-03,  1.6691e-04]], requires_grad=True)
a_fc grad: None

After 3 epochs:

Low pass: Parameter containing:
tensor([-0.0106,  0.0329,  0.0308, -0.1870, -0.0280,  0.6309,  0.7148,  0.2304],
       requires_grad=True)
Low pass grads: None
a_fc: Parameter containing:
tensor([[ 1.8012e-03, -6.6522e-04, -1.7894e-03,  ...,  9.2923e-04,
          1.5849e-03,  1.6501e-03],
        [-1.7078e-03, -7.6538e-04, -3.7305e-04,  ...,  7.5051e-04,
         -1.5665e-05, -1.8910e-03],
        [ 3.6494e-04,  3.6088e-04,  1.8865e-03,  ...,  1.6792e-03,
          1.6700e-03,  5.9159e-04],
        ...,
        [-8.3883e-04, -2.6049e-04,  9.8632e-04,  ..., -2.0582e-03,
         -9.1509e-04, -2.3793e-04],
        [-4.5228e-05,  1.7284e-03, -1.0506e-04,  ...,  1.9384e-03,
         -1.8192e-03, -2.8353e-05],
        [ 5.2225e-04, -1.3121e-03, -1.0122e-03,  ..., -1.5334e-03,
          1.9817e-03,  5.8947e-04]], requires_grad=True)
a_fc grad: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

How one can debug why the gradients are not assigned for the low_pass and high_pass parameters of the WaveletLayer?

KFrank · April 17, 2024, 1:39pm

Hi Mateusz!

Please post a simplified, fully-self-contained, runnable script that reproduces
your issue together with its output.

Best.

K. Frank

mkorycinski · April 24, 2024, 10:22am

Hi @KFrank,

Please find below the script content with the output.

Content:

import math


import torch
import torch.nn as nn
import torch.nn.functional as F


class WaveletLayer(nn.Module):
    
    def __init__(self, input_size: int, filter_size: int = 2, scale: float=math.sqrt(0.5)):
        super().__init__()
        self.input_size = input_size
        self.scale = scale
        
        low_pass, high_pass = self.initialize_random(filter_size=filter_size)
        
        self.low_pass = nn.Parameter(torch.Tensor(low_pass.shape))
        self.high_pass = nn.Parameter(torch.Tensor(high_pass.shape))
        
        self.low_pass.data = low_pass
        self.high_pass.data = high_pass
        
        self.low_pass_out_size = self.input_size + len(self.low_pass) - 1
        self.high_pass_out_size = self.input_size + len(self.high_pass) - 1
    
    def initialize_random(self, filter_size: int):
        low_pass = torch.Tensor(filter_size)
        high_pass = torch.Tensor(filter_size)
        
        low_pass = nn.init.uniform_(low_pass, a=-1, b=1)
        high_pass = nn.init.uniform_(high_pass, a=-1, b=1)
        return low_pass, high_pass

    def convolve(self, x: torch.Tensor, kernel=torch.Tensor):
        kernel = torch.flip(kernel, [0])
        k_l = len(kernel)
        pad_size = k_l - 1
        
        p_x = F.pad(x, (pad_size, pad_size), value=0)
        result = p_x.clone()[...,:-pad_size]
        for bidx in range(result.shape[0]):
            for i in range(len(p_x)-pad_size):
                result[bidx, i] = torch.sum(p_x[i:i+k_l] * kernel)
                # result[bidx, i] = torch.dot(p_x[i:i+k_l], kernel)
        return result
    
    def forward(self, x: torch.Tensor):
        y_low = self.convolve(x, self.low_pass) * self.scale
        y_high = self.convolve(x, self.high_pass) * self.scale
        return y_low, y_high


class WaveletNet(nn.Module):
    def __init__(self, input_size: int):
        super().__init__()
        
        self.wavelet_layer = WaveletLayer(input_size=input_size, filter_size=8)
        self.a_fc = nn.Linear(self.wavelet_layer.low_pass_out_size, 1000)
        self.d_fc = nn.Linear(self.wavelet_layer.high_pass_out_size, 1000)
        self.dense = nn.Linear(2000, 512)
        self.output = nn.Linear(512, 1)
        
        self.activation = nn.ReLU()
        
    def forward(self, x):
        cA, cD = self.wavelet_layer(x)
        
        cA = self.activation(cA)
        cD = self.activation(cD)
    
        x_cA = self.a_fc(cA)
        x_cD = self.d_fc(cD)
        
        x = torch.cat([x_cA, x_cD], dim=1)

        x = self.activation(x)
        
        x = self.dense(x)
        x = self.activation(x)

        x = self.output(x)

        return x

def print_model_info(model):
    print(f'Low pass params: {model.wavelet_layer.low_pass}')
    print(f'Low pass grads: {model.wavelet_layer.low_pass.grad}')
    print(f'High pass params: {model.wavelet_layer.high_pass}')
    print(f'High pass grads: {model.wavelet_layer.high_pass.grad}')
    print(f'a_fc: {model.a_fc.weight}')
    print(f'a_fc grad: {model.a_fc.weight.grad}\n\n\n')

if __name__ == '__main__':
    wn = WaveletNet(input_size=100)
    batch = torch.randn(4, 100)
    labels = torch.zeros(4, 1)
    labels[1,0] = 1
    labels[2,0] = 1
    optimizer = torch.optim.Adam(params=wn.parameters(), lr=1e-3)
    print_model_info(model=wn)
    
    # forward pass
    print('\n\n\nForward pass...')
    y_hat = wn(batch)
    
    # backward pass
    print('Backward pass...\n\n\n')
    loss = ((y_hat - labels)**2).sum()
    loss.backward()
    
    print_model_info(model=wn)
    
    # weights update
    print('\n\n\nOptimizer step...\n\n\n')
    optimizer.step()
    
    print_model_info(model=wn)

Output:

Low pass params: Parameter containing:
tensor([ 0.3232, -0.0624,  0.6717,  0.7571, -0.5782,  0.9844, -0.8107,  0.6903],
       requires_grad=True)
Low pass grads: None
High pass params: Parameter containing:
tensor([ 0.6047,  0.6385,  0.3769,  0.3976, -0.4900,  0.4423,  0.6187,  0.3618],
       requires_grad=True)
High pass grads: None
a_fc: Parameter containing:
tensor([[-0.0884, -0.0653, -0.0750,  ...,  0.0735, -0.0902, -0.0843],
        [-0.0270,  0.0149,  0.0375,  ..., -0.0943,  0.0910, -0.0290],
        [ 0.0018, -0.0531, -0.0136,  ...,  0.0863,  0.0759, -0.0533],
        ...,
        [ 0.0474, -0.0530, -0.0135,  ..., -0.0572, -0.0218, -0.0203],
        [-0.0664, -0.0240,  0.0462,  ...,  0.0055,  0.0155, -0.0033],
        [-0.0453, -0.0809,  0.0781,  ..., -0.0905,  0.0843,  0.0110]],
       requires_grad=True)
a_fc grad: None


Forward pass...
Backward pass...


Low pass params: Parameter containing:
tensor([ 0.3232, -0.0624,  0.6717,  0.7571, -0.5782,  0.9844, -0.8107,  0.6903],
       requires_grad=True)
Low pass grads: None
High pass params: Parameter containing:
tensor([ 0.6047,  0.6385,  0.3769,  0.3976, -0.4900,  0.4423,  0.6187,  0.3618],
       requires_grad=True)
High pass grads: None
a_fc: Parameter containing:
tensor([[-0.0884, -0.0653, -0.0750,  ...,  0.0735, -0.0902, -0.0843],
        [-0.0270,  0.0149,  0.0375,  ..., -0.0943,  0.0910, -0.0290],
        [ 0.0018, -0.0531, -0.0136,  ...,  0.0863,  0.0759, -0.0533],
        ...,
        [ 0.0474, -0.0530, -0.0135,  ..., -0.0572, -0.0218, -0.0203],
        [-0.0664, -0.0240,  0.0462,  ...,  0.0055,  0.0155, -0.0033],
        [-0.0453, -0.0809,  0.0781,  ..., -0.0905,  0.0843,  0.0110]],
       requires_grad=True)
a_fc grad: tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  5.2062e-03,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -4.9044e-03,
         -6.5578e-06, -4.0527e-06],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.1573e-03,
          7.1188e-05,  2.0312e-04],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          2.0860e-05,  5.9520e-05],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
         -2.9760e-05, -8.4916e-05]])


Optimizer step...


Low pass params: Parameter containing:
tensor([ 0.3232, -0.0624,  0.6717,  0.7571, -0.5782,  0.9844, -0.8107,  0.6903],
       requires_grad=True)
Low pass grads: None
High pass params: Parameter containing:
tensor([ 0.6047,  0.6385,  0.3769,  0.3976, -0.4900,  0.4423,  0.6187,  0.3618],
       requires_grad=True)
High pass grads: None
a_fc: Parameter containing:
tensor([[-0.0884, -0.0653, -0.0750,  ...,  0.0725, -0.0902, -0.0843],
        [-0.0270,  0.0149,  0.0375,  ..., -0.0933,  0.0920, -0.0280],
        [ 0.0018, -0.0531, -0.0136,  ...,  0.0853,  0.0749, -0.0543],
        ...,
        [ 0.0474, -0.0530, -0.0135,  ..., -0.0572, -0.0228, -0.0213],
        [-0.0664, -0.0240,  0.0462,  ...,  0.0055,  0.0155, -0.0033],
        [-0.0453, -0.0809,  0.0781,  ..., -0.0905,  0.0853,  0.0120]],
       requires_grad=True)
a_fc grad: tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  5.2062e-03,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -4.9044e-03,
         -6.5578e-06, -4.0527e-06],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.1573e-03,
          7.1188e-05,  2.0312e-04],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          2.0860e-05,  5.9520e-05],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
         -2.9760e-05, -8.4916e-05]])

KFrank · April 24, 2024, 8:17pm

Hi Mateusz!

mkorycinski:

    def convolve(self, x: torch.Tensor, kernel=torch.Tensor):
        ...
        pad_size = k_l - 1
        
        p_x = F.pad(x, (pad_size, pad_size), value=0)
        result = p_x.clone()[...,:-pad_size]
        for bidx in range(result.shape[0]):
            for i in range(len(p_x)-pad_size):
                result[bidx, i] = torch.sum(p_x[i:i+k_l] * kernel)

It turns out that pad_size is greater than len(p_x) so your for i loop tries
to loop over a range() with a negative argument causing the body of the loop
not to be executed. kernel (which is basically WaveletLayer.low_pass and
WaveletLayer.high_pass) never gets assigned into result, so low_pass
and high_pass never become part of the computation graph. Therefore,
when you finally call loss.backward(), autograd never assigns a .grad to
them (and optimizer.step() doesn’t change their values).

If you add a print statement to convolve():

        for bidx in range(result.shape[0]):
            print ('len(p_x)-pad_size =', len(p_x)-pad_size)   # this is negative so loop never runs
            for i in range(len(p_x)-pad_size):
                result[bidx, i] = torch.sum(p_x[i:i+k_l] * kernel)

here’s what you get:

len(p_x)-pad_size = -3
len(p_x)-pad_size = -3
len(p_x)-pad_size = -3
len(p_x)-pad_size = -3

I don’t really understand the logic of your convolve() and its for loops, so
I can’t offer a fix, but this is the cause of the issue you’re seeing.

Best.

K. Frank

mkorycinski · April 25, 2024, 5:29am

Hi KFrank!

That was it, I made a bug when I forgot to switch from unbatched to batched input. len(p_x) was not giving me what I wanted. After switching to p_x.shape[-1] I got what I wanted.

Thank you!