Word2Vec embedding gives strange gradients to Linear Layers

Dumiy · December 27, 2020, 11:51am

Hello,

Sorry if I dont give all the info from the start, this is my first post on forums.

I’ve been using as a feature extractor from text Word2Vec from gensim to have word embedding and at training with a pytorch model the validation result are very alterning from epoch to epoch, I think due to the gradients being very concentrated in the last layer and close to 0 to the other layers. I attached also a plot from the layers gradients. The thing that bothered me is that scikit-learn MLPRegressor gives more stable results out of the same features and replicated in pytorch, giving different results.

The model looks like this, using L1Loss and Adam optimizer

class Net(nn.Module):
    Preformatted textdef __init__(self):
          super(Net, self).__init__()
        self.h0 = torch.randn(1, 1, 150).cuda()
        self.c0 = torch.randn(1, 1, 150).cuda()
        self.lstm = nn.LSTM(200,150,1)
        for name, param in self.lstm.named_parameters():
            if 'bias' in name:
                pass
            elif 'weight' in name:
                nn.init.xavier_normal_(param,gain=nn.init.calculate_gain('tanh'))
        self.fc1 = nn.Linear(150,500)
        nn.init.xavier_normal_(self.fc1.weight,gain=nn.init.calculate_gain('tanh'))
        self.fc2 = nn.Linear(500,300)
        nn.init.xavier_normal_(self.fc2.weight,gain=nn.init.calculate_gain('tanh'))
        self.fc3 = nn.Linear(300, 2)
        nn.init.xavier_normal_(self.fc3.weight,gain=nn.init.calculate_gain('tanh'))
        #self.bc1 = nn.BatchNorm1d(150)
        #self.bc3 = nn.LayerNorm(300)
        self.f = nn.Tanh()
    def forward(self, x):
        x = x.unsqueeze(1)
        x,(self.h0,self.c0) = (self.lstm(x,(self.h0,self.c0)))
        x = self.f(x)
        #print(x.shape)
        x = x.flatten(1)
        x = self.f(self.fc1(x))
        x = self.f(self.fc2(x))
        x = self.fc3(x)
        return x

I want to know why this happens and why I can’t achieve better results then scikit-learn MLPRegressor

Abhilash_Srivastava · December 27, 2020, 11:57am

This could be due to the vanishing gradient problem. Can you try using just one fc layer instead of 3?
Also please format your code for clear reading.

Dumiy · December 27, 2020, 1:51pm

Thank you for the response, I made the changes and kept only one fc and I attached the gradients which kinda look the same and the validation result are worse. I looked out for the vanishing gradients on LSTM and dont seem to find anything

I tried also increasing the batchsize and to do a gradient update on 3 batches instead at every batch and the gradient resulting acumulating gradiens looking like this:

Abhilash_Srivastava · December 27, 2020, 2:53pm

Try using ReLU instead of Tanh for activation.
Can you share the rest of code for training and validation?

Dumiy · December 27, 2020, 3:40pm

Sure the loss is L1Loss and optimizer is Adam and here are the training and validation

def test(loss):
    network.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        val = network(torch.Tensor(wordvec_val).cuda())
        loss_value = loss(val, torch.Tensor(test_labels).long().cuda())
        print(loss_value.item())
        val = val.cpu().detach().numpy()
        print(" ")
        print("Validation MAE : scikit-learn ",mean_absolute_error(val,test_labels))
        print(" ")
        test_val.append(mean_absolute_error(val,test_labels))
def train(epoch,loss):
  network.train()
  for batch_idx, (data, target) in enumerate(train_loader):
        
    optimizer.zero_grad()
    
    output = network(data.cuda())
    
    loss_value = loss(output, target.long().cuda())
    
    loss_value.backward(retain_graph=True)
    
    plot_grad_flow(network.named_parameters())
    
    optimizer.step()
    
    if batch_idx % log_interval == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
        epoch, batch_idx * len(data), len(train_loader.dataset),
        100. * batch_idx / len(train_loader), loss_value.item()))

Abhilash_Srivastava · December 28, 2020, 10:31pm

The train and validation code looks fine to me.

Why are you using loss_value.backward(retain_graph=True) and not just loss_value.backward()?
Any particular reason for using L1 loss instead of any other loss (eg: CrossEntropyLoss)?