Hi everyone, I am implementing a bi-directional LSTM to predict race (Asian, Black, Hispanic, White) from first name, last name, and the racial distribution of the person’s zip code. I am experiencing NaN loss (exploding gradients), despite trying all the common prescriptions. I have ensured that my data contains no null values and that all values are between 0 and 1. I have tried gradient clipping and a smaller learning rate (for example, I have tried lr =.0000000000000000001). I feel like there is something fundamentally wrong with my model, specifically with the way I input the racial distribution. I have implemented a bi-directional LSTM that uses only the person’s first name and last name, and that is running without issues. I was hoping someone could provide some insight.
Here is my model
class FirstLastZctaBiLSTM(nn.Module):
def __init__(self, input_size: int, hidden_size: int, output_size: int):
super(FirstLastZctaBiLSTM, self).__init__()
self.hidden_size = hidden_size
self.lstm = nn.LSTM(
input_size, hidden_size, batch_first=True, bidirectional=True
)
self.h2o = nn.Linear(hidden_size * 2 + 4, output_size) # 4 is number of races
self.softmax = nn.LogSoftmax(dim=1)
def forward(
self,
name: torch.Tensor,
pct: torch.Tensor,
hidden: tuple[torch.Tensor, torch.Tensor],
):
_, (h_n, _) = self.lstm(name, hidden) # grab the last hidden state
h_n = torch.cat((h_n[-2], h_n[-1]), dim=1) # concatenate last hidden state of both directions
# add in distribution at last time step
# pct is tensor of size (batch_size, 4)
combined = torch.cat((h_n, pct), dim=1)
output = self.h2o(combined)
output: torch.Tensor = self.softmax(output)
return output, hidden
def init_hidden(self, batch_size: int):
return (
torch.zeros(2, batch_size, self.hidden_size, device=DEVICE),
torch.zeros(2, batch_size, self.hidden_size, device=DEVICE),
)
Here is my training loop
loss_function = nn.NLLLoss()
opt = optim.SGD(
model.parameters(),
lr=.000001,
momentum=.99,
nesterov=True,
weight_decay=.001,
)
for epoch in range(1, n_epochs + 1):
for batch_no, (name, pct, race) in enumerate(dataloader, start=1):
model.zero_grad(set_to_none=True)
hidden = model.init_hidden(batch_size)
output, _ = model(name, pct, hidden)
loss: torch.Tensor = loss_function(output, race)
loss.backward()
if clip_value is not None:
nn.utils.clip_grad_value_(model.parameters(), clip_value=1)
opt.step()
Something I have also tried is winsorizing the distribution and normalizing the percents so that they sum to one, i.e. (.05, .1, .2, .2) → (0.0909, 0.18181, 0.3636, 0.3636), although this still doesn’t work.