Yet another network-not-training post

I know this is a common topic and I have done my research, but I can’t figure out the issue so I’m asking for your help.

I have a simple one layer network to predict two balanced classed (0 and 1). This is my setup (an entire epoch is all batches in trainBatches):

inputSize = 750
hiddenSize = 50
outputSize = 2 
batchSize = 256

model = torch.nn.Sequential(
    torch.nn.Linear(inputSize, hiddenSize),
    torch.nn.ReLU(),
    torch.nn.Linear(hiddenSize, outputSize),
)
lossFn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

model.train()
for k in trainBatches:

    input_ = Variable(torch.FloatTensor(data[k:k+batchSize,:-1]))
    target_ = Variable(torch.FloatTensor(data[k:k+batchSize,-1]))
    
    output_ = model(input_)

    loss = lossFn(output_, target_.long())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

My input matrix is sparse but not sparser than standard one-hot encoding problems. Target classes are balanced. Output is a non-normalized pair of columns:

In [9]: model(input_)
Out[9]: 
tensor([[ 0.2018, -0.2460],
        [ 0.2018, -0.2460],
        [ 0.2018, -0.2460],
        [ 0.2532, -0.3221],
        [ 0.2641, -0.3455],

And target also has the correct structure:

In [11]: target_.long()
Out[11]: 
tensor([0, 0, 0, 1, 1, 1...])

I’ve tried different reductions, learning rates, hidden sizes and number of layers, and the accuracy always remains at around .5, so there is obviously something wrong in the learning stage independent of parameters/network structure.

This seems like such a simple example and it should be training out of the box, what else can I look into? Any mistakes in my simple setup?

Thanks in advance.

Put zero_grad() after step()

Thanks for your response. If I do like you suggest all my gradients die:

        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]), tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]), tensor([0., 0.])] ```

Original order:

```[tensor([[-0.9561,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.1120],
        [-0.0981,  0.0000,  0.0000,  ...,  0.2065,  0.0000, -0.0217],
        [-0.4990,  0.0000,  0.0000,  ..., -0.2006,  0.1060,  0.0247],
        ...,
        [ 0.1178,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0803],
        [-0.0503,  0.0000,  0.0000,  ...,  0.0000,  0.1357,  0.0162],
        [ 1.3231,  0.0000,  0.0000,  ...,  0.5383,  0.0000,  0.3681]]), tensor([-1.5053, -0.6233, -0.2615, -1.0045,  0.3973,  0.6638,  1.0489, -0.1003,
        -0.2944,  1.3231]), tensor([[ 0.8244, -0.5169, -1.6222, -1.2467,  2.6140,  1.1356, -0.7496, -0.6147,
          0.7515,  1.6823],
        [-0.8244,  0.5169,  1.6222,  1.2467, -2.6140, -1.1356,  0.7496,  0.6147,
         -0.7515, -1.6823]]), tensor([-3.6501,  3.6501])] ```

Regardless of depth and hidden size.

I’m sorry, I meant “move”, your shouldn’t do zero_grad() after forward until all processing is done.

No worries I appreciate your time. Just to make sure I understand, is the order of my steps wrong? Right now I have

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

And I do this for every batch, is this not correct? Not sure what you mean by “shouldn’t do zero_grad() after forward until all processing is done.”

I believe so, yes. ~

Actually, no, your invoke order seems to be working too - I tried your code with toy inputs and it is learning ok.

You should check if regression can learn anything from your dataset, try this:

model = torch.nn.Sequential(
	torch.nn.Linear(inputSize, outputSize),
)
model._modules['0'].weight.data.zero_()
model._modules['0'].bias.data.zero_()
lossFn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adadelta(model.parameters()) #Adadelta requires no lr tuning
1 Like

I ran the regression example you suggest and the model is still not learning. I also ran my code with a small sklearn dataset as a sanity check about my setup and it is in fact learning for that simple case, so it’s definitely not something silly like that. I am able to mostly replicate the lack of learning if I randomize the target column in the small toy sklearn example, so I think I’m facing a deeper issue with my data generation process. Thank you for your time again.

Yeah, if regression doesn’t work at all, it means that either you have no useful predictors, or more powerful methods are needed to find them. Using sklearn classification algorithms should tell you which is the case.

BTW, your batch size may be too big for stochastic gradient descent