Code Review - Sentiment Analysis [Beginner]

Hello Everyone,
I am new to the ML domain. I was trying to attempt the sentiment analysis using the IMDB dataset. I have used Embedding + Padding + LSTM + Unpacking + 3 Linear Layers → Output (Batch_Size , Classes(my case 0,1)). Using cross entropy loss & adam as the optimizer.
Observations : 1) On an average it is taking about 6 mins to run an Epoch
2) Loss does not drop significantly

Please find the model & train loader for reference

class SentiClassify_Model(nn.Module):

def init(self,vocabLen,dims,hidden_size,seqLengths,output_size=2):

super().__init__()

#output_size =  2

self.hidden_size = hidden_size

self.seqLengths = seqLengths

self.embed = nn.Embedding(vocabLen,dims)

self.lstm_cell = nn.LSTM(input_size=dims,hidden_size=hidden_size)

self.lf = nn.Linear(max(seqLengths)*hidden_size,(max(seqLengths)*hidden_size)//2)

self.lf1 = nn.Linear((max(seqLengths)*hidden_size)//2,64)

self.lf2 = nn.Linear(64,output_size)

#self.sf_max = nn.Softmax(dim=1)

def forward(self,input,hidden,verbose=False):

embeds = self.embed(input)

packedSeq = pack_padded_sequence(embeds.permute(1,0,2),self.seqLengths,batch_first=True,enforce_sorted=False)

output,hidden = self.lstm_cell(packedSeq,hidden)

outputt, input_sizes = pad_packed_sequence(output, batch_first=True)

reshaped_out = outputt.reshape(outputt.size()[0],outputt.size()[1]*outputt.size()[2])

lin_output = self.lf(reshaped_out)

lin_output1 = self.lf1(lin_output)

lin_output2 = self.lf2(lin_output1)

return lin_output2

if verbose:

  print("input shape",input.shape)

  print('embed shape',embeds.shape)

  print("Rehaped output",reshaped_out.size())

  print("After Fully conn layer :",lin_output.size())

def init_hiddenlayer(self,batch_size,device=‘cpu’):

return (torch.zeros(1,batch_size,self.hidden_size,device=device),torch.zeros(1,batch_size,self.hidden_size,device=device))

##########################################################################
def _trainLoader(model=None,dataloader=None,vocabList = None,optimm =None,loss_func=None,epochs=1,device=‘cpu’):

Hyperparameters

dims = 10

hidden_size = 20

loss_per_epoch,train_accuracy,test_accuracy = 0,None,None

Batch Optimization

for i in range(epochs):

cummLoss = 0

for ind,data in enumerate(dataloader):

  wordInput,seqLengths,targets = data["Vocab"].permute(1,0),data["Seqlen"],data["Senti"]

  modObj = model(len(vocabList),dims,hidden_size,seqLengths).to(device)

  hidden = modObj.init_hiddenlayer(wordInput.size()[-1],device=device)

  source = modObj(wordInput,hidden)

  if optimm==None:

    optimm = optim.Adam(modObj.parameters(),lr=0.005)

  loss = lossFn(source,targets)

  loss.backward(retain_graph=True)

  optimm.zero_grad()

  optimm.step()

  cummLoss+=loss.item()*source.size()[0] ## Cumulative loss per batch

train_accuracy = computeAccuracy(targets,source)

loss_per_epoch = cummLoss/5000

print("Loss per Epoch : {} , Training Accuracy : {}".format(loss_per_epoch,train_accuracy))

Kindly review and let me know how I can improve the model & training
Thanks in Advance

It seems you are not applying any activation function between the linear layers, so that these layers could be reduced to a single one.
The bigger issue is that you are zeroing out the gradients after calculating them in the backward operation, so that the optimizer won’t update the parameters using these gradients:

  loss = lossFn(source,targets)

  loss.backward(retain_graph=True)

  optimm.zero_grad()

  optimm.step()

Move the zero_grad() operation either to the beginning or end of your training loop and don’t use it between the backward() and step() operations.

Thank you for your valuable inputs. I have understood your explanation

Since you say you’re a beginner, I would start with a simpler model. You currently use hidden states from all time steps. While you can do this in principle, you might run into problems when using packing since it will update the hidden states only until the length of the sequence.

For example, if you have a batch with the longest sequence being 20 and a sequence S in the batch, the last 5 hidden states of S won’t be meaningful, because the LSTM stopped for S at time step 15. You still feed all 20 time steps into the next linear layer.

The safer bet – particularly in the beginning – is to use hidden after

output,hidden = self.lstm_cell(packedSeq,hidden)

hidden contains the LAST hidden state for each sequence. Its shape is (num_layers * num_directions, batch, hidden_size). Since you define your LSTM layer with num_layers=1 and bidirectional=False (default values), you can simply do

hidden = hidden[-1]

To get the final tensors of shape (batch, hidden_size), i.e., the last hidden state for each sequence. Of course, you need to change your linear layers, e.g., self.lf = nn.Linear(hidden_size, hidden_size/2), etc.

Lastly, be very careful with reshape() or view():

reshaped_out = outputt.reshape(outputt.size()[0],outputt.size()[1]*outputt.size()[2])

When not being careful, it can quickly mess up your data. You might want to check this post.

@vdw :Hi Chris , Thanks for your valuable inputs. I have understood your explanation to to part of using final hidden state instead of using all the hidden states (Output).

As last few time steps will have more of padding vectors. Hence you have suggested to take the final hidden state rather than taking hidden state for each time step. Hope my understanding is correct?

Yes, I have taken care of the view part. I am using view only in the final step , to reduce the tensor size from ( batch_size x 1(no.of layers) x hidden size) → (batchsize, no.of layers * hidden size). Could I have used squeeze instead of view to reduce the dimension of the tensor?

As suggested , i have simplified the model. Emebedding + Padding + LSTM —>hidden[-1] to 1 Linear layer.

I tried applying a non linear activation ( ReLU) after computing 1 Liear Layer or tried applying it on the hidden layer of LSTM. But doesnt seem to have any effect.

I would like to one more thing, in my current training function , I am intializing the hidden layers after running every batch. I felt I was wrong in intializing the hidden parameters each time. But when i try to inplace the learnt hidden parameters each time , colab is crashing after 4 or 5 epoch due to high memory usage.
Could you kindly give your inputs on this regard too?

Below is the forward class, yoy may find many unused arguments. Kindly excuse. Please do give your comments

class SentiClassify_Model(nn.Module):

def init(self,vocabLen,dims,hidden_size,seqLengths,batchSize,output_size=2):

super().__init__()

#output_size =  2

self.hidden_size = hidden_size

self.batchSize = batchSize

self.seqLengths = seqLengths

self.embed = nn.Embedding(vocabLen,dims)

self.lstm_cell = nn.LSTM(input_size=dims,hidden_size=hidden_size)

self.lf = nn.Linear(self.hidden_size,output_size)

self.F = nn.ReLU(inplace=False)

def forward(self,input,hidden=None,verbose=False):

embeds = self.embed(input)

packedSeq = pack_padded_sequence(embeds.permute(1,0,2),self.seqLengths,batch_first=True,enforce_sorted=False)

output,hidden = self.lstm_cell(packedSeq,hidden)

linear = self.lf(hidden[-1].permute(1,0,2)).view(self.batchSize,-1)

print(linear.size())

return linear

if verbose:

  print("input shape",input.shape)

  print('embed shape',embeds.shape)

  print("Rehaped output",reshaped_out.size())

  print("After Fully conn layer :",lin_output.size())

def init_hiddenlayer(self,hiddenLayers=1,batch_size=None,device=‘cpu’):

return (torch.zeros(1*hiddenLayers,batch_size,self.hidden_size,device=device),torch.zeros(1*hiddenLayers,batch_size,self.hidden_size,device=device))

Yes, hidden contains the last hidden state w.r.t. to the last time step, even if you use packing where you have sequences of different lengths. So if one sequence is of length 20 and the other of length 15, hidden will contain the 20th and 15th hidden states since both are the last hidden states for these to sequences.

The challenges is not to get the shapes right, but to also ensure that the data does not get scrambled up in the process :). I always consult the documentation to be sure:

The heavy lifting in terms of learning is probably done in the LSTM layer, so having 1, 2 or 3 additional linear layers might not be that important. This also means that you would see a big effect be removing/including a non-linearity. However, I would generally always include one between 2 linear layers.

Re-initializing the hidden state after/before each batch is the standard approach. It also in a sense meaningful if your batches are independent. Say you have independent short texts such as tweets, why should one tweet start with a different initial hidden state then another tweet. Of course, if you build your batches so that the sequences are dependent – e.g., for a language model where use a sliding window over a long text to create your (sequence, target) pairs – then yes preserving the hidden state across batches makes sense.

Colab throws a memory error because the computational graph for hidden will only get larger, so backpropagation will take longer and longer. If you want to preserve the hidden state between batches, you can use hidden.detach() (I’m not sure about the exact syntax). It detaches the hidden state from the computational graph. So it’s kind of a new hidden state but with the same values as the last one.

Hello Chris, thanks for your detailed reply. I am still not able to reduce the loss with epochs. The Accuracy is starting with 0.5 and is continuously in the dropping & peaking in the range[0.48,0.54].

This is the trend of loss & accuracy I get after multiple iteration of corrections/playing with hyperparams. ( Kindly ingnore the mean test accuracy, not printing any value)

image

It would be of great help if you could review my entire code and give your inputs on how I could Improve my model.

Below is the entire code

# -*- coding: utf-8 -*-
"""Sentiment_Analysis_Sequence_Classification.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1av8Lcu21Eg6CKm26SCTINm6cDkUhvXW2

Sentiment Analysis is a Sequence Classification Problem. Here The labels are Positive & Negative.

Data Set : https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv

https://gist.github.com/HarshTrivedi/f4e7293e941b17d19058f6fb90ab0fec
"""

import nltk
from nltk.corpus import stopwords
import pandas as pd
import regex as re
from sklearn.model_selection import train_test_split
import plotly.express as px
from nltk.stem.wordnet import WordNetLemmatizer
from pprint import pprint
from collections  import Counter
import torch
import torch.nn as nn
from torch.utils.data import Dataset,DataLoader
from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence
import torch.optim as optim
from sklearn.metrics import accuracy_score

# Commented out IPython magic to ensure Python compatibility.
nltk.download(["stopwords","wordnet"])
# %cd /root/nltk_data/corpora/stopwords
stop_Words = stopwords.words("english")

from google.colab import drive
drive.mount('/gdrive')

# Commented out IPython magic to ensure Python compatibility.
# %cd /gdrive/MyDrive/IMDB_Senti_Analysis
!ls

isCuda = torch.cuda.is_available()
if isCuda:
  Mydevice = torch.device("cuda")
else:
  Mydevice = torch.device("cpu")

main_df = pd.read_csv('IMDB Dataset.csv')

main_df.head()

"""# Split Data"""

## Converting Positive ->1 and negative -> 0
main_df.sentiment[main_df.sentiment=="positive"]=1
main_df.sentiment[main_df.sentiment=="negative"]=0

main_df.head()

main_df["review"][1]

fig = px.bar(main_df,x=["Positive Review","Negative Review"],y = main_df["sentiment"].value_counts(),)
fig.show()

X,Y = main_df["review"].values,main_df["sentiment"].values ## Converting pd.series -> np array
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.9,stratify=Y)
print(X_train.shape,X_test.shape)

"""
---
# Cleaning Data - Tokenization
***

"""

def _string_cleanUp(arrOf_strs):
  count=0
  listOf_Strs = []
  for e_str in arrOf_strs:
    e_str = e_str.lower()   ## Loawer Casing the entire string
    e_str = re.sub(r'<[^>]*>','',e_str) ## Removing HTML Tags
    e_str = re.sub(r"[^\w\s]", ' ', e_str) ## Remove Special Characters 
    e_str = re.sub(r"\d", '', e_str) ## Remove Numbers from string
    count+=1
    listOf_Strs.append(e_str)
  return listOf_Strs

Cleaned_Sentences = _string_cleanUp(X_train)
for e_line in Cleaned_Sentences[0:5]:
  print(e_line)
  print("\n")

def _token_StringList(StrList,lemObj):
  
  wordList,spl_strs  = [],["<sos>","<eos>","<pad>"]
  for eLine in StrList:
    eLine = eLine.split(" ")
    for eWord in eLine:
      if eWord in stop_Words:continue ## Skipping stop words
      else:
        if  eWord == '':continue
        eWord = lemObj.lemmatize(eWord)
        wordList.append(eWord)
  return wordList

wl = WordNetLemmatizer()
wordToken = _token_StringList(Cleaned_Sentences,wl)

#wordToken = {ind:word for ind,word in enumerate(spl_strs+wordList)}

wordDict = Counter(wordToken)
print(wordDict)

def _return_most_recurringVocab(worDict):
  spl_strs = ["<pad>"]
  vList = [x[0] for x in sorted(worDict.items(),key=lambda x:x[1],reverse=True)[:1000]]
  return {word:ind for ind,word in enumerate(spl_strs+vList)}

"""# Train & Test Data(Indexed Vocab)"""

trainVocab = _return_most_recurringVocab(wordDict)
print(trainVocab)

## Similar activity for Test Data ##
test_Cleaned_Sentences =  _string_cleanUp(X_test)
testWordToken = _token_StringList(test_Cleaned_Sentences,wl)

testVocab = _return_most_recurringVocab(Counter(testWordToken))
print(testVocab)

"""# Custom Data Loader """

## Create a custom dataset loader ## 
class _reviews_loader(Dataset):
  def __init__(self,X,Y):
    super().__init__()
    self.X,self.Y = X,Y
    
  
  def __len__(self):
    #d_frame = pd.read_csv(csv_file_name)
    return len(self.X)
  
  def __getitem__(self,idx):
    returnDict = (self.X[idx],self.Y[idx])
    return returnDict

class MyCollateClass():
  def __init__(self,vocabDict = None):
    self.vocabDict = vocabDict

  def _string_cleanUp(arrOf_strs):
    count=0
    listOf_Strs = []
    for e_str in arrOf_strs:
      e_str = e_str.lower()   ## Loawer Casing the entire string
      e_str = re.sub(r'<[^>]*>','',e_str) ## Removing HTML Tags
      e_str = re.sub(r"[^\w\s]", ' ', e_str) ## Remove Special Characters 
      e_str = re.sub(r"\d", '', e_str) ## Remove Numbers from string
      count+=1
      listOf_Strs.append(e_str)
    return listOf_Strs

  def _return_indexList(self,OneSentance):
    vocabIndexes = []
    for eWord in OneSentance.split(" "):
      if eWord in list(self.vocabDict.keys()):
        vocabIndexes.append(self.vocabDict[eWord])
    idx_Tensor = torch.LongTensor(vocabIndexes)
    return idx_Tensor
  
  def _stack_Sentance_info(self,max_sentence_len = None,batch_size=None,device='cpu'):
    tensorList,updatedTensorList,seqLengths = [],[],[]
    for ind,eLine in enumerate(self.cleanedList):
      retTensor = self._return_indexList(eLine)
      tensorList.append(retTensor)
    maxTensorSize = max(list((e_Tensor.size()[0] for e_Tensor in tensorList)))
    for e_tensor in tensorList:
      seqLengths.append(e_tensor.size()[0])
      if e_tensor.size()[0]<maxTensorSize:
        diff = maxTensorSize - e_tensor.size()[0]
        newTensor = torch.cat([e_tensor,torch.zeros(diff)])
        updatedTensorList.append(newTensor)
      else:updatedTensorList.append(e_tensor)
    finalTensor = torch.stack(updatedTensorList).type(torch.LongTensor).to(device)
    return finalTensor,seqLengths

  def PadCollate(self,batch):
    def _get_max_sentance_len(SentanceList):
      return max(list((len(esentance.split(' ')) for esentance in SentanceList)))
    def _convert_senti_to_int(SentList,device='cpu'):
      sTensor = torch.LongTensor(SentList)
      return sTensor
    batch_Dict = {}
    revList = list((eTuple[0] for eTuple in batch))
    sentiList = list((eTuple[1] for eTuple in batch))
    stacked_senti_tensor = _convert_senti_to_int(sentiList,device=Mydevice).to(Mydevice)
    self.cleanedList = _string_cleanUp(revList)
    maxLen_sentance = _get_max_sentance_len(self.cleanedList)
    stacked_vocab_tensor,seqLengths = self._stack_Sentance_info(maxLen_sentance,len(batch),device=Mydevice)
    batch_Dict = {"Vocab":stacked_vocab_tensor,"Senti":stacked_senti_tensor,"Seqlen":seqLengths}
    return batch_Dict


  def __call__(self,batch):
    return self.PadCollate(batch)

review_dataset = _reviews_loader(X_train,Y_train)
dataloader1 = DataLoader(review_dataset,batch_size = 10,shuffle=True, num_workers=0,collate_fn=MyCollateClass(trainVocab))

for ind,data in enumerate(dataloader1):
  if ind>3:break
  print(data["Vocab"].device)
  print(data["Senti"])
  print(data["Vocab"].shape)
  print(data["Senti"].shape)
  print("seq lenght",data["Seqlen"])
  print('*'*75)

"""MODEL
---
"""

class SentiClassify_Model(nn.Module):
  def __init__(self,vocabLen,dims,hidden_size,seqLengths,batchSize,numLayers,output_size=2):
    super().__init__()
    #output_size =  2
    self.hidden_size = hidden_size
    self.batchSize = batchSize
    self.numLayers = numLayers
    self.seqLengths = seqLengths
    self.embed = nn.Embedding(vocabLen,dims)
    self.lstm_cell = nn.LSTM(input_size=dims,hidden_size=hidden_size,batch_first =True,num_layers=self.numLayers)
    self.lf = nn.Linear(self.hidden_size*self.numLayers,output_size)
    self.F = nn.ReLU(inplace=False)
    
    

  def forward(self,input,hidden=None,verbose=False):
    embeds = self.embed(input).permute(1,0,2)
    output,(hidden,cell) = self.lstm_cell(embeds,hidden)
    hidden.permute(1,0,2)
    linear = self.lf(hidden.view(self.batchSize,-1))
    return linear

    if verbose:
      print("input shape",input.shape)
      print('embed shape',embeds.shape)
      print("Rehaped output",reshaped_out.size())
      print("After Fully conn layer :",lin_output.size())

  def init_hiddenlayer(self,num_layers =1,hiddenLayers=1,batch_size=None,device='cpu'):
    return (torch.zeros(num_layers*hiddenLayers,batch_size,self.hidden_size,device=device),torch.zeros(num_layers*hiddenLayers,batch_size,self.hidden_size,device=device))

def computeAccuracy(target,source):
  sf_max_obj = nn.Softmax(dim=1)
  sf_max = sf_max_obj(source)
  sf_max = torch.argmax(sf_max,dim=1)
  fintensor = torch.where(sf_max==1,1,0) ## 1-> positive ,0->Negative
  score = accuracy_score(target.tolist(),fintensor.tolist())
  return score


def infer(dataLoader,net,device):
  net.eval().to(device)
  allScores = []
  for ind,data in enumerate(dataLoader):
    wordInput,seqLengths,targets = data["Vocab"].permute(1,0),data["Seqlen"],data["Senti"]
    source = net(wordInput)
    allScores.append(computeAccuracy(targets,source))
  return sum/len(allScores)

def _trainLoader(model=None,Train_dataset=None,Test_Loader = None, batchSize =None,vocabList = None,optimm =None,loss_func=None,epochs=1,device='cpu',lr=0.005):
  ## Hyperparameters ##
  dims = 20
  hidden_size = 25
  num_LSTMLayers = 2

  maxLoss = 10000
  dataloader1 = DataLoader(Train_dataset,batch_size = batchSize,shuffle=True, num_workers=0,collate_fn=MyCollateClass(vocabList))
  loss_per_epoch,train_accuracy,test_accuracy = 0,None,None
  ## Batch Optimization ##
  hidden = None
  for i in range(epochs):
    cummLoss = 0
    for ind,data in enumerate(dataloader1):
      wordInput,seqLengths,targets = data["Vocab"].permute(1,0),data["Seqlen"],data["Senti"]
      modObj = model(len(vocabList),dims,hidden_size,seqLengths,batchSize,num_LSTMLayers).to(device)
      hidden = modObj.init_hiddenlayer(num_layers=num_LSTMLayers, hiddenLayers = 1,batch_size = batchSize,device=device)
      #print("ind {}/{}".format(i,hidden[0].size(),hidden[1].size()))
      if wordInput.size()[-1]!=batchSize:continue
      if optimm==None:
        print("Initializing Optimizer")
        optimm = optim.Adam(modObj.parameters(),lr=lr)
      optimm.zero_grad()
      source = modObj(wordInput,hidden)
      loss = lossFn(source,targets)
      loss.backward()
      optimm.step()
      cummLoss+=loss.item()*batchSize ## Cumulative loss per batch
      train_accuracy = computeAccuracy(targets,source)

    loss_per_epoch = cummLoss/batchSize
    train_accuracy = computeAccuracy(targets,source)
    if loss_per_epoch<maxLoss:
      maxLoss = loss_per_epoch
      torch.save({
          'epoch': i,
          'model_state_dict': modObj.state_dict(),
          'optimizer_state_dict': optimm.state_dict(),
          'loss': loss_per_epoch,
          }, "model.pt")
    Mean_testAccuracy=0
    if Test_Loader!=None:
      Mean_testAccuracy = infer(testLoader,modObj,Mydevice)

    
    print("Loss per Epoch : {} , Training Accuracy : {}, Mean Test Accuracy".format(loss_per_epoch,train_accuracy,Mean_testAccuracy))

lossFn = nn.CrossEntropyLoss()
review_dataset = _reviews_loader(X_train,Y_train)
test_data =  _reviews_loader(X_test,Y_test)
testLoader = DataLoader(test_data,batch_size = 500,shuffle=True, num_workers=0,collate_fn=MyCollateClass(testVocab))
_trainLoader(model=SentiClassify_Model,Train_dataset=review_dataset,Test_Loader = None,batchSize=500, vocabList = trainVocab,loss_func=lossFn,epochs=20,device=Mydevice)

sample_tens = torch.tensor([[0.45,0.5],
                            [0.5,0.48]])
print(sample_tens)
bb = torch.argmax(sample_tens,dim=1)
print(bb)
bb = torch.where(bb==1,1,0)
print(bb)

@vdw Please ignore the last 4 or 5 lines of the code. Its a test cell, where i was trying out something

Hm, two things stand out to me.

Firstly, it seems you define your model modelObj for every batch. That creates the model every time and you loose all your changes to the weights. This should be done only once, so I would move this line, say, before dataloader1 = ....

Secondly, I’m still a bit suspicious about the forward() method. Maybe all is correct, but I can test it. It’s a bit strange that you permute the embeddings. Usually the input and therefore the embeddings have batch_size as the first dimension. You also define your nn.LSTM with batch_first=True, so all should be good with output permuting. However, since your code runs, I assume it must be correct.

However, the processing of hidden seems a bit off. hidden first has a shape of (num_layers * num_directions, batch, hidden_size). After permuting it’s (batch, num_layers * num_directions, hidden_size), and after the .view() it’s (batch, num_layers * num_directions * hidden_size) – while people usually use only the last layer, this shouldn’t be a problem in principle. But again, I’m not sure if the data is not messed up, similar to example I’ve already linked to.

To get a first basic version running, I would do the following:

  def forward(self,input,hidden=None,verbose=False):
    embeds = self.embed(input).permute(1,0,2)  # <-- double-check this :)
    output,(hidden,cell) = self.lstm_cell(embeds,hidden)
    # Split layers and directions (useful if you want to try bidirectiona=Truel later on)
    hidden = hidden.view(selfnum_layers, self.num_directions, batch, self.hidden_size)
    # Get the last hidden state with respect to the layers
    hidden = hidden[-1]
    # Get rid of the direction dimension (won't work for bidirectional=True)
    hidden = hidden.squeeze(0)
    # the shape of hidden is now (batch, hidden_size)
    # so self.lf needs to abe nn.Linear(self.hidden_size,output_size)
    linear = self.lf(hidden)
    return linear

Here’s a complete example for an GRU/LSTM-based text classifier. The important part is the forward() method and the handling of the hidden state. This model solves exactly your task and even comes with attentions :slight_smile:

@vdw Hello Chris , your suggestions worked very well. Thank you. When I enabled bidirectional learning, the model improved further.

Somehow when there was no dropout added during the initial training phase, the model was over-fitting. The test accuracy increased marginally.

When added a drop out of p=0.3 , the test accuracy improved & model seems to learn now. I got a test accuracy , train accuracy of 0.6,0.88 respectively for 10 epochs , with lr =0.005.