Word2Vec embedding gives strange gradients to Linear Layers

Hello,

Sorry if I dont give all the info from the start, this is my first post on forums.

I’ve been using as a feature extractor from text Word2Vec from gensim to have word embedding and at training with a pytorch model the validation result are very alterning from epoch to epoch, I think due to the gradients being very concentrated in the last layer and close to 0 to the other layers. I attached also a plot from the layers gradients. The thing that bothered me is that scikit-learn MLPRegressor gives more stable results out of the same features and replicated in pytorch, giving different results.

The model looks like this, using L1Loss and Adam optimizer

class Net(nn.Module):
    Preformatted textdef __init__(self):
          super(Net, self).__init__()
        self.h0 = torch.randn(1, 1, 150).cuda()
        self.c0 = torch.randn(1, 1, 150).cuda()
        self.lstm = nn.LSTM(200,150,1)
        for name, param in self.lstm.named_parameters():
            if 'bias' in name:
                pass
            elif 'weight' in name:
                nn.init.xavier_normal_(param,gain=nn.init.calculate_gain('tanh'))
        self.fc1 = nn.Linear(150,500)
        nn.init.xavier_normal_(self.fc1.weight,gain=nn.init.calculate_gain('tanh'))
        self.fc2 = nn.Linear(500,300)
        nn.init.xavier_normal_(self.fc2.weight,gain=nn.init.calculate_gain('tanh'))
        self.fc3 = nn.Linear(300, 2)
        nn.init.xavier_normal_(self.fc3.weight,gain=nn.init.calculate_gain('tanh'))
        #self.bc1 = nn.BatchNorm1d(150)
        #self.bc3 = nn.LayerNorm(300)
        self.f = nn.Tanh()
    def forward(self, x):
        x = x.unsqueeze(1)
        x,(self.h0,self.c0) = (self.lstm(x,(self.h0,self.c0)))
        x = self.f(x)
        #print(x.shape)
        x = x.flatten(1)
        x = self.f(self.fc1(x))
        x = self.f(self.fc2(x))
        x = self.fc3(x)
        return x

image

I want to know why this happens and why I can’t achieve better results then scikit-learn MLPRegressor

This could be due to the vanishing gradient problem. Can you try using just one fc layer instead of 3?
Also please format your code for clear reading.

Thank you for the response, I made the changes and kept only one fc and I attached the gradients which kinda look the same and the validation result are worse. I looked out for the vanishing gradients on LSTM and dont seem to find anything
image

I tried also increasing the batchsize and to do a gradient update on 3 batches instead at every batch and the gradient resulting acumulating gradiens looking like this:
image

  1. Try using ReLU instead of Tanh for activation.
  2. Can you share the rest of code for training and validation?

Sure the loss is L1Loss and optimizer is Adam and here are the training and validation

def test(loss):
    network.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        val = network(torch.Tensor(wordvec_val).cuda())
        loss_value = loss(val, torch.Tensor(test_labels).long().cuda())
        print(loss_value.item())
        val = val.cpu().detach().numpy()
        print(" ")
        print("Validation MAE : scikit-learn ",mean_absolute_error(val,test_labels))
        print(" ")
        test_val.append(mean_absolute_error(val,test_labels))
def train(epoch,loss):
  network.train()
  for batch_idx, (data, target) in enumerate(train_loader):
        
    optimizer.zero_grad()
    
    output = network(data.cuda())
    
    loss_value = loss(output, target.long().cuda())
    
    loss_value.backward(retain_graph=True)
    
    plot_grad_flow(network.named_parameters())
    
    optimizer.step()
    
    if batch_idx % log_interval == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
        epoch, batch_idx * len(data), len(train_loader.dataset),
        100. * batch_idx / len(train_loader), loss_value.item()))

The train and validation code looks fine to me.

  1. Why are you using loss_value.backward(retain_graph=True) and not just loss_value.backward()?
  2. Any particular reason for using L1 loss instead of any other loss (eg: CrossEntropyLoss)?