Classification neural Network using Autograd

Hey guys, I want to make a simple classification neural network with pytorch’s autograd package. I have gone through some resources that helped me create the code. Problem is the code is not working I have tried some solutions but it does not work for me.
I am trying to classify mnist dataset, I am building simple 4 layer network using matrix multiplications. I am using sigmoid as activation and cross entropy loss function. Network processes code in batches. My request to guys is please have look at code and tell me what i am doing wrong in code? Code the network class and training loop in given below. Link to colab file also given. Thanks in advance https://colab.research.google.com/drive/1V4mfk4Fbc7x4uCIehmSohEVEXwJAnlOn?usp=sharing

class network():

def __init__(self,num_layer,layer_size ): 

    self.W1=torch.randn(layer_size[0],layer_size[1],dtype=torch.float32,requires_grad=True)

    self.b1=torch.randn(layer_size[1],dtype=torch.float32,requires_grad=True)

    self.W2=torch.randn(layer_size[1],layer_size[2],dtype=torch.float32,requires_grad=True)

    self.b2=torch.randn(layer_size[2],dtype=torch.float32,requires_grad=True)

    

    self.W3=torch.randn(layer_size[2],layer_size[3],dtype=torch.float32,requires_grad=True)

    self.b3=torch.randn(layer_size[3],dtype=torch.float32,requires_grad=True)

    self.W4=torch.randn(layer_size[3],layer_size[4],dtype=torch.float32,requires_grad=True)

    self.b4=torch.randn(layer_size[4],dtype=torch.float32,requires_grad=True)

    self.act=torch.nn.ReLU()

    self.act1=torch.nn.Sigmoid()

    self.act_last=torch.nn.Softmax(dim=1)

    self.loss1 = torch.nn.CrossEntropyLoss()

def forward(self,x):

    return self.act1(torch.matmul(self.act1(torch.matmul(self.act1(torch.matmul(self.act1(torch.matmul(x,self.W1)+self.b1),self.W2) +self.b2),self.W3) +self.b3),self.W4)+self.b4)



def loss(self,pred,Y):

    return -1*(torch.sum(Y*(torch.log(pred))))/20

    # return l

model=network(5,[784,392,196,98,10])

lr=0.00001

n_iters=100

for epoch in range(n_iters):

for i in range (3000):

    value = randint(0, train_X.shape[0]-50)

    

    X1=X[value:value+20]

    Y1=Y[value:value+20]

    y_pred=model.forward(X1)

    # print(Y1)

    # print(y_pred)

    l=model.loss(y_pred,Y1)

    

    l.backward()

    with torch.no_grad():

        model.W1-=lr*model.W1.grad

        model.W2-=lr*model.W2.grad

        model.W3-=lr*model.W3.grad

    # value_W4=self.W4-lr*self.W4.grad

        model.b1-=lr*model.b1.grad

        model.b2-=lr*model.b2.grad

        model.b3-=lr*model.b3.grad

    model.W1.grad.zero_()

    model.W2.grad.zero_()

    model.W3.grad.zero_()

    model.b1.grad.zero_()

    model.b2.grad.zero_()

    model.b3.grad.zero_()

print(f'epoch {epoch+1},loss={l:.8f}')

Hi Hamad!

First, I think going through an exercise like this is very worthwhile.
Taking the time to do “by hand” some of the things built in to pytorch
is a great way to really learn what is going on and will prove valuable
when you move on to tackling more complicated problems.

I have not gone through your code in detail, so there may be other
issues, but I do see a couple of things that you should look into.

As a minor note, you might want to remove or comment out things
such as self.act that are unused. Doing so could make your code
more readable, and perhaps less error-prone.

Applying a Sigmoid layer to your final linear layer is not correct for
a multi-class classification problem where you use cross-entropy as
your loss. Doing so will convert each of your ten outputs into an
independent probability (between 0.0 and 1.0), but you want a
probability distribution over the ten classes. That is, you want a set
of ten values that are not only individually between 0.0 and 1.0,
but that sum to 1.0 as well.

To get such a probability distribution you want your final activation
layer to be your self.act_last=torch.nn.Softmax(dim=1) .

(However, what you really want is not to have a final activation layer,
to work directly with the raw-score logits generated by your final linear,
and to write a “cross-entropy-with-logits” loss function that uses
log_softmax() internally.)

You don’t say what Y is here, but, as written, Y should be one-hot
encoded class labels (or more generally, “soft labels” that are given
by a probability distribution over the classes – each in [0.0, 1.0],
and that sum to 1.0).

Some side comments:

As it currently stands, your network has what we call three “hidden”
layers. I would also encourage you to experiment with “shallower”
networks with only one or two hidden layers.

Your learning rate of 0.00001 strikes me as being smaller than it
needs to be. You might be able to train faster with a larger learning
rate (although it is sometimes necessary to start with a smaller rate
during a “warm-up” period and it can be helpful to reduce the rate
towards the end of your training to avoid jumping back and forth
across “gullies.”)

You will almost certainly need more that 100 epochs.

Now, the big one: You aren’t optimizing your final linear layer. This
will make it very hard for your network to train, as the upstream
layers will have to “learn” to undo the damage being done by your
randomly-initialized – and frozen – final layer.

Good luck.

K. Frank

Thanks @KFrank your suggestion helped a lot. i solved my problem. Not using Softmax was perhaps the key drawback in my code.