Training a card game classifier

CesMak · February 21, 2020, 8:25pm

Hey there,
I currently used Monte Carlo Tree Search (MCTS) to predict good actions for a card game (4 players, each 15 cards). This works quite nice, but is computationally expensive. That is why I thought about training a Neuronal Network with that data.
My goal is that this nn should predict me very fast an action for an input state vector.

My data (generated by MCTS) for one batch is as follows:

x: input vector: 180x1
- 60x1 binary vector for card is on the table
- 60x1 binary vector for card is already played
- 60x1 binary vector for card is on the players hand.
  *y: output vector: 60x1 (has only one 1 for the card index of the action that is played by the MCTS player).

Questions:

As you can see I have binary input and binary output vector. Is this advisable to use binary inputs and binary outputs?
I have no doubt on how to design my nn (number of hidden layers, relu or linear etc.) What would you suggest?
Also the loss function I do not know how I should choose it.

I currently searched on classifier examples but I mostly found classifiers for images and also in linear case (no binary inputs / outpus)
Hope you can point me to some examples which suit my case.

KFrank · February 21, 2020, 11:32pm

Hi Markus!

There is no problem in general with having binary inputs and/or
outputs. Also, to first approximation, you don’t get to choose your
inputs and outputs – the problem you are trying to solve gives
them to you.

Note, however, that your target vector – the value of your y vector
that you are given (rather than predict) and use to train with – is
not a general binary vector. It is a class label, equivalent to an
integer class label that takes it values in [0, 59] (inclusive). However,
it is one-hot encoded – that is, it is a binary vector with exactly
one 1 value (as you specified).

I would start by treating this as a conventional classification problem.
You have an input of 180 (binary) “features.” You are trying to predict
which of 60 classes that input corresponds to.

A good starting point would be a one- or two-hidden-layer network.
So, for a single-hidden-layer network, you might have a 180x120
fully-connected layer followed by a rectified-linear activation function
(“ReLU”), followed by a 120x60 fully-connected layer that produces
a vector of 60 raw scores (logits) as your output.

For two hidden layers you could add a second ReLU followed by,
say, a 60x60 fully-connected layer.

You could also try other nonlinear activations in place of ReLU (such
as tanh).

Then feed your output vector and target into CrossEntropyLoss.
(For this, you will need to recast your one-hot target vector into a
single integer class label, because this is what CrossEntropyLoss
expects.)

Now for some more speculative comments:

I assume that in a given position in each of your three 60x1 binary
vectors at most one of the binary values can be 1. That is, you
have 60 different cards, and there is only one copy of any given
card. (And there are three other players whose hands are “hidden”
and therefore are not encoded in any vectors.)

This is structure that your network will have to “learn.” Hypothetically
you could encode this structure in your inputs (so the network won’t
have to learn it), for example, by having your input be a single length-60
vector with entries that take on four values: on-table, played, in-hand,
and in-hidden-hands. However, my intuition tells me that this would be
no better (and likely worse).

Also, I assume that there is a constraint that the class label (the one-hot
target vector in your original description) (and hence the prediction)
can only be one of the cards currently in the player’s hand – that is, a
card for which the player’s hand has a 1 in the corresponding position.

This is also structure that the network must learn. I don’t see an
attractive way of telling the network this structure, but you could
make it easier for the network to learn this structure. To do this,
you could add one more fully connected layer (to either your one- or
two-hidden-layer network) where you concatenate the player’s-hand
length-60 binary vector with the length-60 (former) output vector,
and pass it through a 120x60 fully-connected layer to form your
(new) length-60 output vector.

Giving your network a second “late look” at the player’s hand should
make it easy (I think) for the network to learn the constraint that you
can only play a card you have.

Of course, this is only a bunch of talk about what might work. The
name of the game in this neural-network business is to try it out and
see what actually does work.

This sounds like an interesting (and potentially doable) problem.

Good luck!

K. Frank

CesMak · February 22, 2020, 8:27am

Hey Frank,
thanks for your very helpful and long answer and all your explanations. It would be cool if we could discuss more concrete on the code. I am not sure if I understood everything right. Here is the code based on your answer - would be nice if you can have a short look on it:

Using a constraint is a cool idea (unfortunately I am not sure how to incooperate that).

# tested with python 3.7.5
from torch.utils.data import Dataset
import ast
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from torch.autograd import Variable

# Links:
# https://www.kaggle.com/danieldagnino/training-a-classifier-with-pytorch
# https://discuss.pytorch.org/t/training-a-card-game-classifier/70625/2

class MyDataSet(Dataset):
    '''A line in the dataset consists of
    [bin_options+bin_on_table+bin_played]+[action_output+[round(bestq, 2)]]
    Input: 180x1
    Output: one-hot encoded 60x1
    The bestq comes from the mcts estimation of how good the played action is
    This values is currently not used.
    '''
    def __init__(self, data_root):
        self.samples = []
        with open(data_root,"r") as f:
            self.samples = [ [ast.literal_eval(ast.literal_eval(elem)[0]), ast.literal_eval(ast.literal_eval(elem)[1])] for elem in f.read().split('\n') if elem]
        print(len(self.samples)) # ca. 18000

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        # Function that returns one input and one output (label)
        # as one dim output use:
        # torch.Tensor(self.samples[idx][0]), torch.Tensor([self.samples[idx][1]].index(1))
        return torch.Tensor(self.samples[idx][0]), torch.Tensor([self.samples[idx][1]])

class my_model(nn.Module):
    '''
    '''
    def __init__(self, n_in=180, n_hidden=1, n_out=60):
        super(my_model, self).__init__()
        self.n_in  = n_in
        self.n_out = n_out

        self.fc1 = nn.Linear(self.n_in, 120)
        # TODO
        # insert fully con layer here direclty connect
        # hand with output
        self.relu1 = nn.ReLU(inplace=True)
        self.fc2 = nn.Linear(120, self.n_out)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        return x

if __name__ == '__main__':
    from torch.utils.data import DataLoader
    my_data = MyDataSet('actions__.txt')

    my_loader = DataLoader(my_data, batch_size=1, num_workers=0)

    model = my_model()

    criterium = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

    # Taining.
    loss_values = []
    running_loss = 0.0
    epoch        = 0

    for k, (data, target) in enumerate(my_loader):

        data   = Variable(data,          requires_grad=False) # input
        target = Variable(target.long(), requires_grad=False) # output

        #squeeze the target here?!
        # s your target has a channel dimension of 1, which is not needed using nn.CrossEntropyLoss or nn.NLLLoss.
        #target = target.squeeze(1)
        #TODO recast target  one hot to single!

        # Set gradient to 0.
        optimizer.zero_grad()

        # Feed forward.
        pred = model(data)
        print(pred.shape, target.shape) # torch.Size([1, 60]) torch.Size([1, 1, 61])

        #ValueError: Expected input batch_size (1) to match target batch_size (61).
        # Fails:
        loss = criterium(pred, target.view(-1))

        # Gradient calculation.
        loss.backward()

        # print statistics
        running_loss += loss.item()
        if k % 100 == 99:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, k + 1, running_loss / 100 ))
            loss_values.append(running_loss / 100 )
            running_loss = 0.0

        # Model weight modification based on the optimizer.
        optimizer.step()

Further Notes:
I have currently 18786 batches -> I think I need more or?

One Batch (line) in actions__.txt looks like:
[’[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]’, ‘[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, -9.96]’]

KFrank · February 22, 2020, 4:31pm

Hello Markus!

I haven’t looked at your code in detail, but here are comments on some
things I noticed:

CesMak:

class my_model(nn.Module):
    def __init__(self, n_in=180, n_hidden=1, n_out=60):
        super(my_model, self).__init__()
        self.n_in  = n_in
        self.n_out = n_out

        self.fc1 = nn.Linear(self.n_in, 120)
        self.relu1 = nn.ReLU(inplace=True)
        self.fc2 = nn.Linear(120, self.n_out)

First note that this version of my_model only supports one hidden
layer. So I would just get rid of the n_hidden argument to the
constructor – it’s not used at all, so it just confuses things.

It’s not a big deal, but you might consider not hard-coding the value
of 120 for the number of hidden neurons. It’s a value you might
want to play with, so you could make it an argument to your
my_model constructor, e.g.,

    def __init__(self, n_in=180, n_middle=120, n_out=60):

Adam is a perfectly reasonable optimizer, and often works better
than SGD. But you might also want to experiment with SGD as it
has fewer moving parts, so it can make things a little easier to
understand when you’re developing a new model.

Indeed.

        pred = model(data)
        print(pred.shape, target.shape) # torch.Size([1, 60]) torch.Size([1, 1, 61])

        #ValueError: Expected input batch_size (1) to match target batch_size (61).
        # Fails:
        loss = criterium(pred, target.view(-1))

Several things here:

First (and not so important), your batch size – the number of
individual input samples you process per optimizer step – is 1.

As a general rule (and, as a general rule, general rules for neural
networks are often wrong), you want to have “many” samples
in a batch. I don’t really have any intuition for your specific problem,
but I think playing with batch sizes ranging from 50 to 500 might
make sense. (Powers of 2 look cool.)

Now for the error: Your target shape is wrong. data.shape is
presumably [nBatch, 180] (where you have nBatch = 1).
pred.shape = [nBatch, 60] is correct.

For CrossEntropyLoss you need to have target.shape = [nBatch],
that is, a single integer class label for each sample in your batch.
target.shape = [1, 1, 61]is wrong, and taking .view (-1)
doesn’t fix that because target.view (-1).shape = [61], which
is still wrong.

As I understand it, when you read in target, you read in 61 numbers:
a length-60 one-hot encoded class label, followed by what you call
bestq. First, you need to get rid of bestq – your model has no way
to know to ignore it, so if you include it you will get an error. Second,
you need to convert your one-hot encoded (59 '0’s and one ‘1’)
class-label vector into a single integer (in the range [0, 59], inclusive)
class label. You can do this as follows:

dummy, integer_class_batch = one_hot_class_batch.max (dim = 1)`

This version of max() returns the tuple (max, argmax).
dummy = max = 1, and we discard it. argmax is how we get the
single integer class label from the one-hot vector.

Here, one_hot_class_batch.shape = [nBatch, 60], and
integer_class_batch.shape = [nBatch].

I don’t have intuition about your specific problem. In general,
20,000 samples for a (straightforward) 60-class classification
problem should be plenty. But the structure of your card game
could make the problem hard. How easy or hard is it for a
person to choose the right move just looking at the hands?

A couple more comments:

You might consider first building a similar model for a well-studied
problem (for example MNIST digit classification). That way you’ll
be able to separate bugs in your code (or suboptimal models) from
the problem itself being hard to solve.

As a modified approach, you might consider training your model not
with a target that is the single best move, but with scores for how
good each possible move is (or, for practical purposes, perhaps how
good each of the ten best moves is).

Suppose you have five possible moves: (bad, bad, good, good, best).
Categorical cross-entropy (that is, target is the single correct class
label) will score the two following predictions equally:

   (33% (bad), 33% (bad),  0% (good),  0% (good), 33% (best))
   ( 0% (bad),  0% (bad), 33% (good), 33% (good), 33% (best))

But a model that makes the second prediction will play a much better
game (even if not perfect) than a model that makes the first.

You talk about bestq. If you have q values for all possible moves
(or for the best several moves), you could use (values derived from)
your q values as your target, and use what I’ll call (for lack of a
better term) “full” cross-entropy for your loss function.

Last, you mention:

I think you might be referring to the constraint I speculated about that
you can only play a card you actually have in your current hand.

My point was not that you would somehow impose such a constraint.
(That is, you wouldn’t ever make such a constraint explicit or build it
into your model.) But if such a constraint is part of the rules of your
game, your model with have to learn that rule. My suggestion to add
to the model a “late look” at what’s in the current hand is a proposal
for making it easier for the model to learn that rule.

Good luck!

K. Frank

CesMak · February 23, 2020, 8:31pm

Hey Frank,
I changed my code in the dataloader to just give back the class index as target (With that modification the code ran through without any errors).

def __getitem__(self, idx):
        return torch.Tensor(self.samples[idx][0]), torch.Tensor([self.samples[idx][1][:-1].index(1)])

My problem now was that I am not sure how to interpret the output if I give an 180x1 input vector to the trained model. I get back the following:
learning_rate = 1e-5

tensor([ 0.0982,  0.0182,  0.0380,  0.0131, -0.1438, -0.1540, -0.1128, -0.1452,
         0.0520, -0.0235, -0.0581, -0.1654,  0.0238, -0.0367, -0.0718,  0.0265,
        -0.0240,  0.0926, -0.0354, -0.1772,  0.0667,  0.0795, -0.1556, -0.1301,
        -0.1172,  0.1152,  0.1159, -0.0975,  0.0838, -0.0328, -0.0970, -0.0034,
        -0.0415,  0.0879,  0.0251,  0.0339, -0.0162, -0.0447, -0.1247, -0.0288,
        -0.0193,  0.0407,  0.0237, -0.1138, -0.0198,  0.0213,  0.0173,  0.0107,
         0.0198,  0.0166, -0.0446, -0.1113, -0.0847, -0.1868, -0.0057, -0.0450,
        -0.1188, -0.1111, -0.0694,  0.0380], grad_fn=<AddBackward0>)

or learning rate=0.1:

tensor([20.7308, -0.3333, -0.3342, -0.3399, -0.3359, -0.3355, -0.3352, -0.3441,
        -0.3427, -0.3442, -0.3405, -0.3442, -0.3394, -0.3452, -0.3483, -0.3503,
        -0.3363, -0.3425, -0.3441, -0.3375, -0.3388, -0.3365, -0.3406, -0.3325,
        -0.3509, -0.3480, -0.3365, -0.3329, -0.3407, -0.3453, -0.3416, -0.3386,
        -0.3493, -0.3467, -0.3386, -0.3418, -0.3363, -0.3493, -0.3445, -0.3397,
        -0.3425, -0.3497, -0.3367, -0.3371, -0.3520, -0.3403, -0.3394, -0.3397,
        -0.3496, -0.3333, -0.3350, -0.3397, -0.3334, -0.3502, -0.3414, -0.3430,
        -0.3341, -0.3643, -0.3401, -0.3340], grad_fn=<AddBackward0>)

What is now the action index I should play -> Taking e.g. 21 did not work (playing moves that are not possible)

Well currently I have in my dataset the bestmove index and the q value which is the expected output value(reward) if the player plays this card.

Hm ok I could train my Classifier also on Q-Values. Now the interpretation of the 60 outputs would reflect its q value (reward if game finished) is that correct? I then would know for which action I get the best q value and I simply have to find the max. q for each of the possible actions among the 60 outputs. - Is that correct?

Referring to the constraint. Yes it is a constraint a player can only play a card it has on its hand.
I think a NN like this one would be nice ( output value is the q value):

I am not sure how to construct this model and how to train it. I will work on this one as well.

KFrank · February 24, 2020, 5:31pm

Hi Markus!

CesMak:

My problem now was that I am not sure how to interpret the output if I give an 180x1 input vector to the trained model. I get back the following:
learning_rate = 1e-5
tensor([ 0.0982,  0.0182,  0.0380,  0.0131, -0.1438, -0.1540, -0.1128, -0.1452,
...
What is now the action index I should play → Taking e.g. 21 did not work (playing moves that are not possible)

As to how to interpret the output, the output of your network means
(as always) what you train it to mean.

You are training a conventional 60-class classifier using cross-entropy
(with categorical class labels) as the loss function. What does this
mean?

First, you are using CrossEntropyLoss. This takes raw-score logits
as its predictions (the output of your network). Internally to
CrossEntropyLoss these are converted to probabilities by (in effect)
passing them through softmax().

Cross-entropy then measures how dissimilar two probability distributions
are.

What are your two probability distributions? They are the probability
of “0” being the right, move, the probability of “1” being the right
move, and so on. That is, they are discrete probability distributions
on the space [0, 59] (inclusive). As probabilities, they are between
0 and 1, and sum to 1, sum{i = 0, 59} (P(i)) = 1.

Your first probability distribution is your prediction, that is, the output of
your network (more precisely, the output of your network after passing
it through softmax()). Your second distribution is your target. In this
case your target probability distribution has a special form – exactly
one of the 60 probabilities is 1, with the other 59 being 0.

So you are training your network to output the probability of of each
of the 60 moves being the right move. The simplest rule will be to
play the move that has the largest predicted probability of being right.
(Again, your network outputs logits, but the larger (more positive) the
logit, the larger the probability, and the smaller (more negative) the
logit, the smaller the probability, so choosing the move with the largest
logit is the same as choosing the move with the largest probability.)

Again, the output of your network means what you train it to mean,
and you’ve trained yours (by using CrossEntropyLoss) to output
the predicted probability of each of the 60 moves being right.

Now, as a practical matter, in order to play the game, you know that
only some moves are actually possible. Your network – not being
perfect – may sometimes predict an impossible move. It’s perfectly
reasonable to play the possible move with the largest predicted
logit (probability), rather than playing an impossible move with a
larger logit. Here, you would be taking the output of your (imperfect)
network and combining it with some simple knowledge you have that
your network hasn’t (fully) learned. In practice, that’s okay.

Best.

K. Frank

CesMak · March 3, 2020, 6:45pm

Hey Frank,

training worked, however the result is not so good the mcts trained player is only a little bit better than a random player.

I think the next step is to go to real learning. However for this multiplayer game learning is not so easy, as action space can vary depending on the cards. Moreover cards of other players are not known (imperfect information). For this reason I would try a Policy Gradient Method to learn in such a case.

What would you suggest? The only similar example I found is here, which is not documented at all

You can checkout the game here:

KFrank · March 4, 2020, 12:37am

Hello Markus!

I would ask how well the training worked, and how you measured that.

Common practice is to break your dataset into two pieces: a training
dataset that you use to train your network, and a test / validation
dataset that you do not train on and that you use to test your trained
network.

You want to track your loss function, usually for every batch, since
you have it per-batch anyway, and your “accuracy” function, again,
usually for every batch, since you already have the per-batch
prediction, and the extra cost of calculating the accuracy is not
that much.

For a conventional classification problem, the accuracy function is
the fraction of predictions that are correct – that is, how often you
predicted the best card to play for the sample hand. If you always
make the correct prediction, your accuracy would be 1 (100%, if
you prefer that). If your predictions were effectively random, your
accuracy would be 0.01667 – you would guess the right move one
time in sixty.

If your network is training well, then you should see your loss fall
to a number that is close (in some units) to zero, and your accuracy
rise up close to one. If this isn’t happening, your network isn’t really
training well. (It might be training as well as it can, given the difficulty
of the problem, but not really training well.)

You also want to track the loss and accuracy of your network on your
test dataset – that is, on samples that were not used in training the
network. If it’s not too expensive, you can evaluate your test loss and
accuracy every batch. In any event, you should probably run your
test evaluation at least once every epoch. Your network can “train”
well – low loss, high accuracy – on your training set, but still not
perform well on your test set. This is often referred to as “overfitting”
or “not generalizing.”

If your network is overfitting, there are some things you can try. Try
fewer parameters and try regularization / weight decay. There is
some thinking that ReLU activations are likely to generalize better
than things like tanh(). You can also reduce overfitting by training
on more data, if, of course, you have it.

If your network is not training well, try running the training longer.
The training process sometimes hits “plateaus.” Try changing
the learning rate – both larger and smaller. Try using a different
optimizer, e.g., switch SGD for Adam (or vice versa). It can also be
worth changing your network architecture, in particular, adding more
and/or wider layers (both of which translate to more parameters).
These “bigger” networks are likely to train more slowly, but may end
up performing better after the slower training.

Lastly, if your network is training and generalizing well – that is, if
it is performing well on your test dataset, in particular, achieving a
high accuracy – but not playing a good game, then there is something
unrealistic with the data you are training on. After all, if your network
is doing a good job predicting the moves chosen by your “MCTS”
scheme, but not playing well, then presumably your MCTS system
shouldn’t be playing well either.

Charts as your training progresses of your loss and accuracy on
both your training and test datasets would help give a sense of
which scenario is likely to be playing out.

(As for other learning methods that might be useful for your card game,
I don’t have much to say. I don’t know much about potentially relevant
learning methods, and I don’t have much intuition about your specific
use case.)

Good luck.

K. Frank