Help improving sports prediction model

Hi there,

:rotating_light: I’m very much beginning my journey into PyTorch, and thought I’d reach out for advice and suggested improvements.

I’m playing with a model that predicts football (soccer) matches. Raw data is in this CSV format:


Full dataset here

For the ‘result’ column:

H: Home team won
A: Away team won
D: The match was a draw

I can feed my model a home and away team (which are converted into a list of unique ints), and have it predict the result of a match in ints (H: 2 / A: 1 / D: 0).

But after training for a while it’s not that effective, I can see the loss going down to about 0.49, but I can’t seem to reduce it more than that.

Is this just the nature of sports data, or am I introducing any bad practices in my code? Any tips and guidance on this kind of project would be greatly appreciated. :pray:

import matplotlib.pyplot as plt
import numpy as np
import pandas
import torch
from torch import nn
from sklearn.model_selection import train_test_split

def get_data():
    csv = pandas.read_csv('./data.csv')
    data = csv.drop(
        columns=['season', 'home_goals', 'away_goals'])
    return data

def get_teams():
    # Combine home and away team names, get unique cases + optionally sort
    teams_unique = pandas.concat(
        [data['home_team'], data['away_team']]).unique()
    teams_sorted = np.sort(teams_unique)
    teams = dict(zip(teams_sorted, range(len(teams_sorted))))
    return teams

# Build dictionary
data = get_data()
teams = get_teams()

def get_team(team_str="Arsenal"):
    # Get one hot encoded teams function, for use now and later when predicting
    return teams[team_str]

# Features / teams as ints
data_features = []
for r in data.itertuples():
    data_features.append([get_team(r.home_team), get_team(r.away_team)])

for r in data_features[:10]:
    print(list(teams.keys())[r[0]], "vs", list(teams.keys())[r[1]])

# Scores
data_scores = []
for r in data[["result"]].itertuples():
    result = r.result
    res = 0
    if result == "H":
        res = 2
    elif result == "A":
        res = 1
        res = 0

# Split the data into training and testing sets
X = torch.tensor(data_features, dtype=torch.float32)
y = torch.tensor(data_scores, dtype=torch.int64)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED)

    f"X_train: {X_train.shape}: {X_train.dtype} | y_train: {y_train.shape}: {y_train.dtype}")
    f"X_test: {X_test.shape}: {X_test.dtype} | y_test: {y_test.shape}: {y_test.dtype}")

# Build the model

class ModelV1(nn.Module):
    def __init__(self, INPUT_FEATURES=2, OUTPUT_FEATURES=2, HIDDEN_UNITS=8):
        self.layers = nn.Sequential(
            nn.Linear(in_features=HIDDEN_UNITS, out_features=HIDDEN_UNITS),

    def forward(self, x):
        return self.layers(x)

INPUT_FEATURES = X_train.shape[1]
HIDDEN_UNITS = len(teams) * 4


# Loss
loss_fn = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Accuracy

def accuracy_fn(outputs, targets):
    correct = torch.sum(outputs == targets).item()
    acc = (correct/len(outputs)) * 100
    return acc

# Prepare
torch.backends.mps.manual_seed = RANDOM_SEED

# Set no of epochs
EPOCHS = 1000
print_steps = round(EPOCHS / 100)
losses = []
for epoch in range(EPOCHS):
    y_logits = model(X_train)
    outputs = torch.softmax(y_logits, dim=0).argmax(dim=1)
    loss = loss_fn(y_logits, y_train)
    acc = accuracy_fn(outputs, y_train)
    if epoch % print_steps == 0:
        print(f"Epoch: {epoch+1}/{EPOCHS} | Loss: {loss:.5f}")
    with torch.inference_mode():
        # Forward pass
        test_logits = model(X_test)
        outputs = torch.softmax(test_logits, dim=1).argmax(dim=1)
        # Calculate test loss / acc
        test_loss = loss_fn(test_logits, y_test)
        test_acc = accuracy_fn(outputs, y_test)

# Compare results

print_steps = round(len(outputs) / 10)
correct = 0

for i, o in enumerate(outputs):
    is_correct = y_test[i].item() == o.item()
    icon = "✅" if is_correct else "❌"
    correct += 1 if is_correct else 0
    if i % print_steps == 0:
            f"{icon} Actual: {y_test[i].item():.2f} | Predicted: {o.item():.2f}")
print("-" * 30)
print(f"Correct: {correct} / {len(outputs)}")
print(f"Accuracy: {correct/len(outputs)*100:.2f}%")

# Plot training and test losses
plt.plot(range(EPOCHS), losses, label="Test Loss")
plt.legend(prop={'size': 12})

The posted raw data does not seem to contain any features besides the outcome from past games. Could you explain what you expect the model to learn from this data?

1 Like

:wave: Hey @ptrblck, I’m hoping (naively perhaps!) that by giving the model history of home and away teams, along with past results (Home team won / Away team won / Draw), it can learn which team combinations tend to result in which match outcomes. A few examples of potential learning:

  • Which teams tend to win against other teams generally
  • When teams tend to beat teams at home (indicates strong home advantage)
  • When one team happens to be really good at beating another specific team either home or away (perhaps due to their playing tactics)

Would adding more features (home / away goals, other match statistics potentially) to the input tensors be a way of reducing loss? Ideally after training, I would only want to provide two teams (home and away) as inputs to the model for it to predict the result.

@ptrblck sorry for the nudge, but any thoughts here on my last reply?

  1. Tanh may make for a better activation layer than Sigmoid for intermediate layers.
  2. Conv1d or a TransformerEncoder may provide better results, as games further away in time may have less impact on the outcome. Structure the data so that input dims are something like [ batch_size, num_game_season, (win/tie/loss, score ratio)]
  3. You could encode the results of past games with Win = 1.0, Tie = 0.5, Loss = 0.0 for inputs and probability distribution for outputs.
  4. Dropout on the intermediate layers may help. TransformerEncoder can be set with the dropout argument.
  5. Simply using a score ratio of loser/winner scores could be added as a second channel, or 0.5 for tie(that will prevent divide by zero in the case of 0 / 0).
1 Like

Thanks very much! Will give these a go. :pray:

More as a side note: Did you also try more traditional models (e.g., Decision Trees, Random Forests, Gradient Boosted Trees)? I wouldn’t be surprised if those work better for your type of structured data – at least this is my observations with my course projects (classification or regression task over structured data): neural network-based models never come out on top when team compare different methods.

1 Like

Thanks @vdw , I haven’t tried traditional models like the ones you mention, will look into trying and comparing their results against the above soon. :slight_smile:

@J_Johnson me again, I’m slowly working my way through your and @vdw 's options here to understand what will work best.

I’m still learning about Transformers, for the TransformerEncoder approach you mentioned, would you be able to share a rough example of what a TransformerEncoder model itself might look like in this instance? I’ve started a thread here as I’m having trouble adapting the PyTorch example to a classification task like this. Happy to reward you for your time. :pray:

In the case of a language model using a TransformerEncoder, you have the following:

First, there is an embedding layer. This takes every word and vectorizes it, giving it some sort of semantic meaning. Usually, there is an input of size batch size, sequence length and output of batch size, sequence length, embedding dim(assuming batch_first = True). The embedding dim contains qualitative information about the meaning of each word. Perhaps a certain row corresponds with hot/cold, where coffee might score 0.2 and Iced Tea might score 0.8. So the data and training establishes these values and meanings.

Likewise, each team may have some qualitative value. Perhaps this team scores higher on defense and strategy, but lower on speed and accuracy. Anyway, the model can learn it’s own internal qualitative representation of each team during training.

Those embeddings then enter the TransformerEncoder, which is something like batch size, sequence length, embedding dim. You want some history of who is playing who. Such as you might have your input sequence be in pairs (home_0, away_0, home_1, away_1, …) where each of those are teams and are ordered by oldest to most recent matches. In language models, the transformer layer is used to attend to what the next token prediction may be. In this case, though, you would not be predicting next token, but would be predicting the outcome of the last two in the sequence, I.e. Cowboys vs. the Eagles.

In a language model, you then have a linear layer which predicts the next token by giving out a probability score of every possible word, something of size batch size, all possible words. And then you argmax on that or use some random selection of the top 10% or … Anyway, next token selection in LLMs is irrelevant in your use case, so I’ll spare you that explanation. You would then have a final Linear layer which, instead of being next token prediction, is predicting the outcome of the game, something of size batch size, 3(win/lose/tie).

1 Like

Thank you so much for this thorough reply! I’ll give it a whirl soon and report back.

Hi Matt,
I found this thread from google searching. I am also looking at sports prediction using a moderately large neural net. How is progress with your project? My results are looking ok but not outstanding. It is not obvious to me how to specify the dimensionality for embedding vectors, attention head sizes etc. There are too many combinations to search through them all. Have your experiments taught you anything about these things?
Best wishes, Ben.

1 Like

Hey @Ben_P , I must admit things have stalled a little. I too ended up with ‘ok’ results using a Transformer model (I had more luck with a Conv1d network actually, reaching about 63% accuracy when predicting a Home / Draw / Away winner, but this could certainly be down to my naivety with Pytorch.

So I’ve parked the project for a while amidst other things and aiming to come back soon. Sharing something that you might find useful here:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        self.dropout = nn.Dropout(p=dropout)
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2)
                             * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        x = x +[:x.size(0)]
        return self.dropout(x)

class ModelV6(nn.Module):
    def __init__(self, params):

        VOCAB_SIZE = params["VOCAB_SIZE"]
        NUM_CLASSES = params["NUM_CLASSES"]
        HIDDEN_DIM = params["HIDDEN_DIM"]
        ATT_HEADS = params["ATT_HEADS"]
        DROPOUT = params["DROPOUT"]

        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(EMBEDDING_DIM, DROPOUT)
        encoder_layers = TransformerEncoderLayer(
        self.transformer_encoder = TransformerEncoder(
            encoder_layers, ENCODER_LAYERS)
        self.embedding = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIM)
        self.d_model = EMBEDDING_DIM
        self.linear = nn.Linear(
            in_features=EMBEDDING_DIM, out_features=NUM_CLASSES)

    def init_weights(self) -> None:
        initrange = 0.1, initrange), initrange)


    def forward(self, src: Tensor, src_mask: Tensor = None) -> Tensor:
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        if src_mask is None:
            src_mask = nn.Transformer.generate_square_subsequent_mask(
        output_encoded = self.transformer_encoder(src, src_mask)
        # Pass mean of encoded output into linear layer for classification
        output = self.linear(output_encoded.mean(dim=1))
        return output

Let me know if you have questions / would like to share ideas.

Hi @Matt-T,
This code looks interesting. Thanks for sharing. I think it will take me a while to parse.
I’ve been trying to follow/reproduce ideas explained in Andrej Karpathy’s video tutorial (github resources here), which I really like. My model is nowhere near as big or sophisticated as Andrej’s but it does use stacked (single-head) attention layers with the idea that, with some modification, this could ‘put the outcomes of games in the context of preceding games’. I guess this is the premise that would motivate the effort to persevere here. If the games are independent given easily-measured team-attributes then we are probably best using simpler methods. I guess lots of people have tried to model sports with big nets, but I don’t see a lot of papers on the topic. I could be looking in the wrong places though.
Best wishes, Ben.

1 Like

Interesting stuff, I hadn’t seen Andrej’s video before, thanks for that. Your approach sounds good, and I couldn’t find a huge amount of papers on the topic either. One thing I haven’t tried yet, at the opposite end of the spectrum, is @vdw 's suggestion above of using more traditional models to achieve a better result. It’s on my list to pull together a better comparison and write up at some stage.

Hi @Matt-T,
I have experimented quite extensively with generalized linear models for match statistics. Simple versions are easy to fit in R and do pretty well in terms of prediction - but not well enough to be profitable. My impression is that the GLMs are great at capturing the linear trends that account for the majority of the variation in outcomes but are not so good at non-linearities and interaction effects between covariates. I think that sports experts can and do take the predictions from GLMs and tweak them a bit by hand to partially account for these extra features.

It would be cool to embed a GLM in the neural net and arrange for a regularizing term that shrinks the net towards the GLM when the extra flexibility provided by additional layers is not called for… I think this could be achieved with appropriately penalized skip-layers. I will report back with any breakthroughs. Good luck with your experiments, Ben.