The NN model is not converging!

Hi everyone,
I have a code as following. The accuracy and loss of validation (I am training and validation on the same dataset) are not converging!
idx_l is a list with a size of 96.
I really appreciate any help you can provide.

class Net(nn.Module):
    def __init__(self, num_input, num_hidden, num_output, dropout,
        super(Net, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(num_input, num_hidden)
        self.fc2 = nn.Linear(num_hidden, num_output)

        if activation == 'tanh':
            self.activation_f = torch.tanh
        elif activation == 'relu':
            self.activation_f = torch.relu

    def forward(self, x):
        x = self.activation_f(self.fc1(x))
        x = self.dropout(x)
        x = torch.sigmoid(self.fc2(x))
        return x

model_net= Net(num_input=8, num_hidden=4, num_output=2, dropout=0.0, activation=‘tanh’)

def model(data, label, idx_l, model_net):

    dataset = TensorDataset(data, label)
    data_loader = DataLoader(dataset, batch_size=len(idx_l), shuffle=False, drop_last=True)

    optimizer = torch.optim.SGD(model_net.parameters(), lr=0.0001, momentum=0.9)

    max_nr_batches = 1
    iteration_count = 0
    for batch_idx, (data, label) in enumerate(data_loader): 
        data, label =,
        data = data.clone().detach().requires_grad_(True)
        output= model_net(data)
        loss = loss_fn(pred=output, target=label)
        grad = torch.autograd.grad(outputs=loss, inputs=data)

    return grad, model_net
def validation(data, label,idx_l, model_net):

    dataset = TensorDataset(data, label)
    data_loader = DataLoader(dataset, batch_size=32, shuffle=False, drop_last=True)

    iteration_count = 0
    val_loss = 0.0
    correct = 0
    max_nr_batches = 3
    with torch.no_grad():
        for batch_idx, (data, label) in enumerate(data_loader): 
            data, label =,
            output= model_net(data)
            val_loss += loss_fn(output, label).item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(label.view_as(pred)).sum().item()
            if iteration_count >= max_nr_batches >= 0:
    val_loss /= len(data_loader.dataset)
    accuracy = 100.0 * correct / len(data_loader.dataset)

    return val_loss, correct, len(data_loader.dataset), accuracy

Can you try with some weight_decay? maybe weight_decay=5e-4?

@InnovArul Thanks for your reply.
I applied it, but it did not work.

I am not sure what task you are working on. But the lines above seems not needed.
Can you clarify why you need those?

Sure. is a typo error it would be I rewrote them.

I want to use grad to sending it to other virtual machines to update the models there. Moreover, with loss.backward(retain_graph=True) the model would update in terms of all leaf nodes (data and weights).

The following part is just using if I would pass some data to be trained not all the data. However, in the model () function I want to pass all data so the following part is not needed.

if iteration_count >= max_nr_batches >= 0:

You do not need inside training loop.
Push the model to GPU before creating the optimizer.

I am not sure what do you mean by this. Even without retain_graph=True, the model update will do what you mentioned. No?

It seems like you have a larger distributed training mechanism. Maybe if you try simple one-node training to overfit the model with small batches of data, you might be able to find the cause.

If I call .backward() I will get the gradient for all the leafs (the input as well as all the weights in the net) then I can optimize the net for all the leafs.
Moreover, I need the gradient w.r.t input data which I get it by data.grad.

I guess if I call backward twice I should use retain_graph=True first time.

I will do it, Thanks for your suggestions.

What do you mean by not coverging, that loss is not decreasing? If so, the primary culprit is a bad learning rate most likely.

lr seems pretty low for SGD. How many epochs are you training this for?

Thanks for your reply.
I tried different values of lr, but it did not change.

Thanks for your answer,
I tried with 10 epochs.
Actually I am using a data which I did not work on it myself, I know just one hot encoding and normalization are done on it, may one hot encoding cases this?