Hi,

I have been trying to integrate BERT and BiLSTM as a part of a *one class classification* experiment and I noticed the gradient values of LSTM are returning 0 after few iteration. I am trying to play with my own defined loss function. Could you please help me finding out the hidden bug in this code? Or is it the usual scenario for such loss function?

This is my model’s structure:

```
class Experiment3(nn.Module):
def __init__(self, bert):
super(Experiment3, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.lstm = torch.nn.LSTM(768, 128,2,batch_first=True,bidirectional=True)
self.fc = nn.Linear(128*2, 1)
def forward(self, ids, mask):
sequence_output, pooled_output = self.bert(ids,attention_mask=mask)
lstm_output, (h,c) = self.lstm(sequence_output[:, 0, :])
weights_=self.fc.weight
weights=(torch.transpose(weights_, 0, 1))
biases=nn.Parameter(torch.tensor([0.01]))
fc_output=torch.matmul(lstm_output,weights)-biases
return weights,biases,fc_output
```

The sample dataset has 100 data with labels(0 or 1). I have used a dataloader object with random sampling.

```
batch_size=1
train_data = TensorDataset(train_seq, train_mask, train_y)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
```

The train method is following:

```
def train():
model.train()
for i, batch in enumerate(train_dataloader):
batch = [b.to(device) for b in batch]
sent_id, mask, labels = batch
model.zero_grad()
weights,biases,output_= model(sent_id, mask)
loss = customLoss(weights,biases,output_)
loss.backward()
for name, param in model.named_parameters():
print(name, param.grad)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
gc.collect()
def customLoss(weight,bias,output):
a=0.5*torch.square(torch.linalg.norm((weight)))
temp=bias-output
z = torch.zeros_like((temp))
b=[torch.mean(torch.maximum(z,temp))]
l2 = a+(1/(0.01))*b[0]-bias
return l2
```

The driver’s code is following:

```
model = Experiment3(bert)
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)
epochs = 1
current = 1
while current <= epochs:
print(f'\nEpoch {current} / {epochs}:')
train()
current = current + 1
```

I would appreciate if you help me understanding this NN and its tricky gradient nature.