Constant Loss function problem

Hello,
In my model i do this operation a matrix multiplication in the forward

new_D = ((torch.conj(M.T) @ self.W )/ torch.linalg.norm(torch.conj(M.T) @ self.W,dim=0)).type(torch.complex128)
I put it through the layers, where W is the only thing learned. But, I keep getting the same loss function.

Could you post a minimal and executable code snippet demonstrating this behavior, please?

        
    def __init__(self, W_init: Tensor, normalize: bool = True) -> None:
            # W shape: [N,A]
            super().__init__()

            if normalize:
             
                self.W = nn.Parameter(torch.tensor(W_init/np.linalg.norm(W_init,axis=0),dtype=torch.complex128))

            else:
                # Initialization of the weights without normalization 
                self.W = nn.Parameter(torch.tensor(W_init,dtype=torch.complex128))

        
            
            
                
    def forward(self, x: Tensor, M:Tensor,  k: int,sigma: Optional[float] = None, sc: int = 2) -> Tuple[Tensor, Tensor, Optional[np.ndarray]]:
           
            
        residual = x.clone()

        new_D = ((torch.conj(M.T) @ self.W )/ torch.linalg.norm(torch.conj(M.T) @ self.W,dim=0)).type(torch.complex128)

then I use this new_D in the forward pass, note that M is changed from batch to another because I am performing an online training but self.W is the same the behavior of the cost function should decrease as i am updating the weights, but here is the results:

Batch 0  cost function: 0.39710514336578767 
Batch 1  cost function: 0.3441645648279276 
Batch 2  cost function: 0.42096891700141115 
Batch 3  cost function: 0.39754175134558895 
Batch 4  cost function: 0.38801711819081297 
Batch 5  cost function: 0.40680617950575526 
Batch 6  cost function: 0.40031897748444123 
Batch 7  cost function: 0.3503572570090402 
Batch 8  cost function: 0.35948698372772214 

Your code snippet is not executable so I added some missing parts and it works fine for me:

class MyModule(nn.Module):
    def __init__(self, W_init: Tensor, normalize: bool = True) -> None:
        super().__init__()
        if normalize:
            self.W = nn.Parameter(torch.tensor(W_init/np.linalg.norm(W_init,axis=0),dtype=torch.complex128))
        else:
            self.W = nn.Parameter(torch.tensor(W_init,dtype=torch.complex128))

    def forward(self, M: Tensor) -> Tensor:
        new_D = ((torch.conj(M.T) @ self.W )/ torch.linalg.norm(torch.conj(M.T) @ self.W,dim=0)).type(torch.complex128)
        out = new_D.mean()
        return out
    
w_init = torch.randn(10, 10)
module = MyModule(w_init)

x = torch.randn(10, 10).to(torch.complex128)
out = module(x)

out.norm().backward()
print(module.W.grad)
# tensor([[ 1.0674e-02+0.j, -1.3424e-02+0.j,  6.4672e-03+0.j,  7.6663e-03+0.j,
#           8.6813e-04+0.j,  5.5113e-03+0.j,  5.5467e-03+0.j,  1.6664e-03+0.j,
#           1.0426e-02+0.j,  8.5490e-03+0.j],
# ...


w_init = torch.randn(10, 10)
module = MyModule(w_init, normalize=True)

x = torch.randn(10, 10).to(torch.complex128)
out = module(x)

out.norm().backward()
print(module.W.grad)
# tensor([[-1.0415e-02+0.j, -2.1933e-02+0.j, -8.0953e-03+0.j, -5.2337e-03+0.j,
#          -1.2110e-02+0.j, -1.7197e-02+0.j, -1.5394e-02+0.j, -3.2122e-02+0.j,
#          -2.5804e-02+0.j, -8.1433e-03+0.j],

Yes, the gradient is calculated, but the issue is that when I train with numerous batches, the cost remains relatively stable. The outcome of the training ends up like this:

Batch 0 cost function: 0.36964430249205493 (this cost over test data)
cost train tensor(0.3855, dtype=torch.float64)
cost train tensor(0.3736, dtype=torch.float64)
Batch 1  cost function: 0.3695768928982403 
cost train tensor(0.3781, dtype=torch.float64)
Batch 2 cost function: 0.3695796137680072 

...
cost train tensor(0.3876, dtype=torch.float64)
Batch 45 cost function: 0.3698434706092006 

Otherwise if I want to do like this : where my input = x @ torch.conj( M ) and the weights are multiplied by the same M did this affect the backpropagation ? As I said I am performing an online learning, thank you in advance!

class MyModule(nn.Module):
    def __init__(self, W_init: Tensor, normalize: bool = True) -> None:
        super().__init__()
        if normalize:
            self.W = nn.Parameter(torch.tensor(W_init/np.linalg.norm(W_init,axis=0),dtype=torch.complex128))
        else:
            self.W = nn.Parameter(torch.tensor(W_init,dtype=torch.complex128))

    def forward(self, M: Tensor,x: Tensor) -> Tensor:
        new_D = ((torch.conj(M.T) @ self.W )/ torch.linalg.norm(torch.conj(M.T) @ self.W,dim=0)).type(torch.complex128)
               
        out =( new_D @ x ).mean()
        return out
    
w_init = torch.randn(2, 2)

module = MyModule(w_init)
optimizer=optim.Adam(module.parameters(),lr=0.01,weight_decay=0.5)

for i in range(10):
    optimizer.zero_grad() 
    x = torch.randn(2, 2).to(torch.complex128)
    M = torch.randn(2, 2).to(torch.complex128)
    input = x @ M
    out = module(input ,M)
    optimizer.step()

    out.norm().backward()

    '''for name, param in module.named_parameters():
                # Vérifiez si le paramètre a un gradient non nul
                if param.grad is not None:
                    print(f'Paramètre : {name}, Gradient : \n{param.grad}')
                else:
                    print(f'Paramètre : {name}, Pas de gradient calculé') '''
    print('out',out.norm())

Every operation using a trainable parameter will be tracked by Autograd and will influence the gradients. If you are using self.W in any operation, this parameter will receive gradients and will then be updated assuming it was passed to an optimizer.

Yes, I appreciate the explanation.
However, in my scenario, I encounter a recurring issue where i multiply in each batch by a different matrix M, and with the same M i multiply with the weights, but my cost function either it remains constant or decrease slightly .

Batch 0 cost function: 0.17308901803227464
Batch 1 cost function: 0.17007461108137648
Batch 2 cost function: 0.1719049257029129
Batch 3 cost function: 0.170692183723448
Batch 4 cost function: 0.1685383030535764

Batch 9 cost function: 0.15578051793514183

Batch 11 cost function: 0.15963812362716534
Batch 12 cost function: 0.16094831311693214

Even when employing a learning rate of 0.1, almost constant.
Do you have any insights on why this might be happening?