How to reduce the loss in a simple training any further

I will post my simplified code first:

import torch
import torch.nn.functional as F
import numpy as np

def argmax(x, axis=-1):
    return F.one_hot(torch.argmax(x, dim=axis), list(x.shape)[axis]).float()

loss_function = torch.nn.MSELoss()

def optimization(V, U, E, beta, learning_rate):
    V = V.cuda()
    U = U.cuda()
    E = E.cuda()

    V_optim = V.detach().clone()
    V_optim.requires_grad = True

    optimizer = torch.optim.Adam([V_optim], lr=learning_rate)

    dif    =  1
    epoch  =  1

    num_epochs = 200
    for epoch in range(num_epochs):
        optimizer.zero_grad()
        Q_R = argmax(U + beta * torch.transpose(V_optim, 0, 1))
        loss = loss_function(V_optim, torch.matmul(Q_R * U, E) + beta * torch.matmul(Q_R, V_optim))
        loss.backward(retain_graph=True)
        optimizer.step()
        
        print(f"Epoch {epoch}, Loss: {loss}")

    return V_optim

if __name__ == "__main__":
    n = 100
    beta = 0.98 
    alpha = 0.03
    delta = 1

    kss = ((1 / beta - (1 - delta)) / alpha) ** (1 / (alpha - 1))
    k = np.linspace(0.5 * kss, 1.4 * kss, n)

    k_reshaped = k.reshape(-1, 1)
    c = k_reshaped ** alpha + (1 - delta) * k_reshaped - k
    c[c < 0] = 1e-11
    c = np.log(c)

    V = (np.log(kss ** alpha - delta * kss) / (1 - beta)) * torch.ones(n, 1, requires_grad=True)
    U = torch.tensor(c, dtype=torch.float32)
    E = torch.ones(n, 1)

    learning_rate = 0.002

    optimal_V = optimization(V, U, E, beta, learning_rate)
    print(optimal_V)

According to the attached code, the (mean) loss after 200 iterations is

    Epoch 199, Loss: 3.335637765999877e-09

But the maximum single loss element in the last iteration result is

    tensor(0.0006, device='cuda:0', grad_fn=<MaxBackward1>)

Based on this simple learning I wonder:

(1) is there any solution to reduce the loss, especially the single loss, any further?

(2) is this a problem concerning the choice of loss function type or the format of inputs?

Many thanks.