I am trying to run the following code for a neural network chess evaluation function. The dataset appears to be correct, but when I try to train this model, the gradients are always zero. Not sure where I am going wrong.
Thanks for any help!
class EvaulationFunction(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(64*12+6, 70)
self.fc2 = nn.Linear(70, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
data = pd.read_csv("utils/data.csv", header=None)
torch_data = torch.tensor(data.values)
train_len = (int(len(torch_data)*.9)//16)*16
test_len = len(torch_data)-train_len
train, test = torch.utils.data.random_split(torch_data, [train_len, test_len])
trainloader = torch.utils.data.DataLoader(train, batch_size=16)
testloader = torch.utils.data.DataLoader(test, batch_size=16)
evf = EvaulationFunction()
criterion = nn.L1Loss()
optimizer = optim.SGD(evf.parameters(), lr=.1, momentum=0.9)
for epoch in range(10):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs = data[:,:-1].type(torch.float32)
labels = data[:,-1].type(torch.float32)
print(inputs)
print(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = evf(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0
#print(list(evf.parameters()))
torch.save(evf.state_dict(), "model.pt")
It still isn’t converging, and I believe that the model isn’t updating the parameters. (Every epoch for every 2000 positions, the average loss per position is calculated, and it doesn’t change at all between epochs)
I would recommend trying to overfit a small subset of your dataset (e.g. just 10 sample) first by playing around with some hyperparameters (e.g. learning rate etc.). This is how I’ve tested your code and made sure the model itself is able to overfit random (noise) data.
Not entirely clear on your objective in this instance, but I do know in chess, positional information is important. In fact, Google’s AlphaZero was designed with this in mind, using a series of convolutional layers before the linear layers.
When you flatten the chessboard, you end up decorrelating much of that positional information.
On the other hand, it looks like you also have some scalar information(i.e. +12+6) you’d like for the model to consider. You’d probably be best to add this on the subsequent linear layers after the CNN, if those do not contain any positional information.