Question from a novice, my neural network loss does not converge

wjk13720041 · September 3, 2022, 3:43pm

I want to train a model where the loss has been increasing before the last modification. The loss does not converge after the modification.

that is my model:

class u_seta(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()

        self.block = nn.Sequential(
                nn.Linear(input_dim, 256),
                nn.PReLU(),
                nn.Linear(256, 512),
                nn.PReLU(),
                nn.Linear(512,256),
                nn.PReLU(),
                nn.Linear(256, output_dim)
            )

    def forward(self, x):
        out = self.block(x)
        return out

I am using Adam Optimizer:

u_setaNet = utils.KRU_utils.u_seta(75,72).to(device)
optimizer = torch.optim.Adam(u_setaNet.parameters(),lr=0000000000.1)

Training sequence

for now_step in range(n+1):
            if(now_step==0):
                q_i_est = data_3D_pose.clone().to(device)
                X_est = utils.KRU_utils.pose_to_3d(q_i_est,data_3D).to(device)
                x_est = utils.KRU_utils.joint3d_to_2d(X_est,data_3D).to(device)
                Loss = utils.KRU_utils.LossF(x_est,data_2D_keypoint).to(device)
                Loss = Loss.sum(dim=1)
                X_est = torch.autograd.grad(outputs=Loss,inputs=X_est,grad_outputs=torch.ones_like(Loss))[0]
                X_est = X_est.reshape(X_est.shape[0],-1).to(device)
                Loss = Loss.detach()

            else:
                
                q_i_est = q_i_est.detach()
                X_est.requires_grad = True

                q_i_est_alter = u_setaNet(X_est) 
                
                q_i_est = q_i_est + q_i_est_alter

                X_est = utils.KRU_utils.pose_to_3d(q_i_est,data_3D).to(device)
                x_est = utils.KRU_utils.joint3d_to_2d(X_est,data_3D).to(device)

                Loss = utils.KRU_utils.LossF(x_est,data_2D_keypoint).to(device)
                Loss = Loss.sum(dim=1)
                
                optimizer.zero_grad()
                X_est = torch.autograd.grad(outputs=Loss,inputs=X_est,grad_outputs=torch.ones_like(Loss))[0]
                optimizer.step()

                X_est = X_est.reshape(X_est.shape[0],-1).to(device)
                Loss = Loss.detach()

The loss function is:

def LossF(joint2d,data_2D_keypoint):
    Loss = joint2d[:,:,0:2] - data_2D_keypoint[:,:,0:2]
    Loss = torch.sqrt(Loss[:,:,0]** 2 +  Loss[:,:,1]** 2)
    Loss = (Loss*data_2D_keypoint[:,:,2])
    return Loss

A series of transformations are performed after the network output, will this cause anomalies in the training?

How should I modify it? please help me to find the problem

eqy · September 3, 2022, 9:14pm

I am a bit confused by lr=0000000000.1, are you intending to lower the learning rate (which would make sense as a first thing to try if a model is not converging)? I believe 0000000000.1 would simply be parsed as 0.1, so you may want to try e.g., 0.01, 0.001, or even scientific notation e.g., (1e-4).

wjk13720041 · September 4, 2022, 1:36am

Thank you for your answer!
Indeed, I intend to reduce the learning rate.I want to write 0.0000000001.

lr=0.1:
0 1 tensor(3062.5969, device=‘cuda:0’)
0 1 tensor(2638.7090, device=‘cuda:0’)
0 1 tensor(2935.3193, device=‘cuda:0’)
0 1 tensor(2183.3059, device=‘cuda:0’)
0 1 tensor(2613.2559, device=‘cuda:0’)
0 1 tensor(2317.4438, device=‘cuda:0’)

lr= 1e-4:
0 1 tensor(1728.2727, device=‘cuda:0’)
0 1 tensor(1882.7476, device=‘cuda:0’)
0 1 tensor(1797.2104, device=‘cuda:0’)
0 1 tensor(1854.1182, device=‘cuda:0’)
0 1 tensor(2128.3904, device=‘cuda:0’)
0 1 tensor(2666.2241, device=‘cuda:0’)

lr= 0.0000000001:
0 1 tensor(2332.1382, device=‘cuda:0’)
0 1 tensor(2551.4082, device=‘cuda:0’)
0 1 tensor(2360.8560, device=‘cuda:0’)
0 1 tensor(2940.4958, device=‘cuda:0’)
0 1 tensor(3134.4497, device=‘cuda:0’)
0 1 tensor(2796.6272, device=‘cuda:0’)

InnovArul · September 4, 2022, 1:48am

Naive question: I do not see a loss.backward() call in your code. Are you calling it anywhere else?
I am sure you might already know this.
The usual procedure is,

--- model forward ---
--- loss calculation ---

optimizer.zero_grad()
loss.backward()
optimizer.step()

wjk13720041 · September 4, 2022, 3:25am

Thank you for your answer!
I am using：

torch.autograd.grad(outputs=Loss,inputs=X_est,grad_outputs=torch.ones_like(Loss))

I have previously tried using

loss.backward()

But X_est does not seem to calculate the gradient

ptrblck · September 4, 2022, 6:53am

That’s an interesting point as I also don’t fully understand your current use case.
It seems you are using a model called u_setaNet and are initializing the optimizer with its parameters:

u_setaNet = utils.KRU_utils.u_seta(75,72).to(device)
optimizer = torch.optim.Adam(u_setaNet.parameters(),lr=0000000000.1)

However, in your “training sequence” code snippet you are only calculating the gradients with inputs=X_est, which is the output of the model instead of the parameters of the model itself.
I guess you are expecting autograd.grad to compute all gradients and accumulate them in the .grad attributes of the parameters, which is not the case. Also, you are overriding X_est with the output of utils.KRU_utils.pose_to_3d so note that this tensor is not the input to the model anymore.

In any case, even if you are using autograd.grad(outputs=Loss, inputs=u_setaNet.parameters()) you would still need to apply these gradients, since they are only assigned to X_est again without any further use.

wjk13720041 · September 4, 2022, 7:28am

Your answer has inspired me!
I have modified my code as follows：
(Modified parts are marked with “# Modified here”)

for now_step in range(n+1):
    if(now_step==0):
        q_i_est = data_3D_pose.clone().to(device)
        q_i_est.requires_grad = True # Modified here
        
        X_est = utils.KRU_utils.pose_to_3d(q_i_est,data_3D).to(device)
        X_est.retain_grad() # Modified here

        x_est = utils.KRU_utils.joint3d_to_2d(X_est,data_3D).to(device)
        Loss = utils.KRU_utils.LossF(x_est,data_2D_keypoint).to(device)
        Loss = Loss.sum(dim=1)

        Loss.backward(torch.ones(X_est.shape[0]).to(device)) # Modified here
        X_est = X_est.grad # Modified here
        X_est = X_est.reshape(X_est.shape[0],-1).to(device)
        Loss = Loss.detach()
    
    else:
        
        q_i_est = q_i_est.detach()
        X_est = X_est.detach()  # Modified here
        X_est.requires_grad = True
    
        q_i_est_alter = u_setaNet(X_est)
        
        q_i_est = q_i_est + q_i_est_alter
    
        X_est_again = utils.KRU_utils.pose_to_3d(q_i_est,data_3D).to(device)   # Modified here
        x_est_again = utils.KRU_utils.joint3d_to_2d(X_est_again,data_3D).to(device) # Modified here
    
        Loss = utils.KRU_utils.LossF(x_est_again,data_2D_keypoint).to(device)
        Loss = Loss.sum(dim=1)
        
        optimizer.zero_grad()
        Loss.backward(torch.ones(X_est.shape[0]).to(device)) # Modified here
        optimizer.step()

        X_est = X_est.grad # Modified here
        X_est = X_est.reshape(X_est.shape[0],-1).to(device)
    
        loss_txt_cont += Loss.sum().item()
        Loss = Loss.detach()
        print(data_startP,data_endP,Loss.sum())

The result is back to a situation where the gradient keeps increasing

lr=1e-3

0 1 tensor(2325.0229, device=‘cuda:0’)
0 1 tensor(2532.3550, device=‘cuda:0’)
0 1 tensor(2623.1548, device=‘cuda:0’)
0 1 tensor(2475.8281, device=‘cuda:0’)
0 1 tensor(2532.1895, device=‘cuda:0’)
0 1 tensor(2526.4036, device=‘cuda:0’)

lr=1e-4

0 1 tensor(2150.0894, device=‘cuda:0’)
0 1 tensor(2583.2695, device=‘cuda:0’)
0 1 tensor(2855.8999, device=‘cuda:0’)
0 1 tensor(2928.6746, device=‘cuda:0’)
0 1 tensor(3010.9243, device=‘cuda:0’)
0 1 tensor(3128.9155, device=‘cuda:0’)

I need to input ∂loss/∂X_est to the network and then get the output of the network, i.e. the amount of adjustment of q_i_est that maps out X_est. After that, through a series of changes, the value obtained by the transformation is compared with the target to get the loss. Since it is not possible to directly compare the loss of the network output with the target, I want to optimize only the network and not the transformations.
What might generally cause this condition?
Using Loss.backward and then using optimizer.step should automatically perform gradient descent, right?