Help improving sports prediction model

Hi there,

:rotating_light: I’m very much beginning my journey into PyTorch, and thought I’d reach out for advice and suggested improvements.

I’m playing with a model that predicts football (soccer) matches. Raw data is in this CSV format:

season,date,home_team,away_team,home_goals,away_goals,result
2019,2019-08-09,Liverpool,Norwich,4,1,H

Full dataset here

For the ‘result’ column:

H: Home team won
A: Away team won
D: The match was a draw

I can feed my model a home and away team (which are converted into a list of unique ints), and have it predict the result of a match in ints (H: 2 / A: 1 / D: 0).

But after training for a while it’s not that effective, I can see the loss going down to about 0.49, but I can’t seem to reduce it more than that.

Is this just the nature of sports data, or am I introducing any bad practices in my code? Any tips and guidance on this kind of project would be greatly appreciated. :pray:

import matplotlib.pyplot as plt
import numpy as np
import pandas
import torch
from torch import nn
from sklearn.model_selection import train_test_split


def get_data():
    csv = pandas.read_csv('./data.csv')
    data = csv.drop(
        columns=['season', 'home_goals', 'away_goals'])
    return data


def get_teams():
    # Combine home and away team names, get unique cases + optionally sort
    teams_unique = pandas.concat(
        [data['home_team'], data['away_team']]).unique()
    teams_sorted = np.sort(teams_unique)
    teams = dict(zip(teams_sorted, range(len(teams_sorted))))
    return teams


# Build dictionary
data = get_data()
teams = get_teams()


def get_team(team_str="Arsenal"):
    # Get one hot encoded teams function, for use now and later when predicting
    return teams[team_str]


# Features / teams as ints
data_features = []
for r in data.itertuples():
    data_features.append([get_team(r.home_team), get_team(r.away_team)])

for r in data_features[:10]:
    print(list(teams.keys())[r[0]], "vs", list(teams.keys())[r[1]])


# Scores
data_scores = []
for r in data[["result"]].itertuples():
    result = r.result
    res = 0
    if result == "H":
        res = 2
    elif result == "A":
        res = 1
    else:
        res = 0
    data_scores.append(res)


# Split the data into training and testing sets
RANDOM_SEED = 42
X = torch.tensor(data_features, dtype=torch.float32)
y = torch.tensor(data_scores, dtype=torch.int64)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED)

print(
    f"X_train: {X_train.shape}: {X_train.dtype} | y_train: {y_train.shape}: {y_train.dtype}")
print(
    f"X_test: {X_test.shape}: {X_test.dtype} | y_test: {y_test.shape}: {y_test.dtype}")


# Build the model

class ModelV1(nn.Module):
    def __init__(self, INPUT_FEATURES=2, OUTPUT_FEATURES=2, HIDDEN_UNITS=8):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_features=INPUT_FEATURES,
                      out_features=HIDDEN_UNITS),
            nn.Sigmoid(),
            nn.Linear(in_features=HIDDEN_UNITS, out_features=HIDDEN_UNITS),
            nn.Sigmoid(),
            nn.Linear(in_features=HIDDEN_UNITS,
                      out_features=OUTPUT_FEATURES)
        )

    def forward(self, x):
        return self.layers(x)


INPUT_FEATURES = X_train.shape[1]
HIDDEN_UNITS = len(teams) * 4
OUTPUT_FEATURES = 4

model = ModelV1(INPUT_FEATURES, OUTPUT_FEATURES, HIDDEN_UNITS)

# Loss
loss_fn = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Accuracy


def accuracy_fn(outputs, targets):
    correct = torch.sum(outputs == targets).item()
    acc = (correct/len(outputs)) * 100
    return acc


# Prepare
torch.manual_seed(RANDOM_SEED)
torch.backends.mps.manual_seed = RANDOM_SEED

# Set no of epochs
EPOCHS = 1000
print_steps = round(EPOCHS / 100)
losses = []
for epoch in range(EPOCHS):
    model.train()
    y_logits = model(X_train)
    outputs = torch.softmax(y_logits, dim=0).argmax(dim=1)
    loss = loss_fn(y_logits, y_train)
    acc = accuracy_fn(outputs, y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    model.eval()
    losses.append(loss.item())
    if epoch % print_steps == 0:
        print(f"Epoch: {epoch+1}/{EPOCHS} | Loss: {loss:.5f}")
    with torch.inference_mode():
        # Forward pass
        test_logits = model(X_test)
        outputs = torch.softmax(test_logits, dim=1).argmax(dim=1)
        # Calculate test loss / acc
        test_loss = loss_fn(test_logits, y_test)
        test_acc = accuracy_fn(outputs, y_test)

# Compare results

print_steps = round(len(outputs) / 10)
correct = 0

for i, o in enumerate(outputs):
    is_correct = y_test[i].item() == o.item()
    icon = "✅" if is_correct else "❌"
    correct += 1 if is_correct else 0
    if i % print_steps == 0:
        print(
            f"{icon} Actual: {y_test[i].item():.2f} | Predicted: {o.item():.2f}")
print("-" * 30)
print(f"Correct: {correct} / {len(outputs)}")
print(f"Accuracy: {correct/len(outputs)*100:.2f}%")


# Plot training and test losses
plt.plot(range(EPOCHS), losses, label="Test Loss")
plt.legend(prop={'size': 12})
plt.show()

The posted raw data does not seem to contain any features besides the outcome from past games. Could you explain what you expect the model to learn from this data?

1 Like

:wave: Hey @ptrblck, I’m hoping (naively perhaps!) that by giving the model history of home and away teams, along with past results (Home team won / Away team won / Draw), it can learn which team combinations tend to result in which match outcomes. A few examples of potential learning:

  • Which teams tend to win against other teams generally
  • When teams tend to beat teams at home (indicates strong home advantage)
  • When one team happens to be really good at beating another specific team either home or away (perhaps due to their playing tactics)

Would adding more features (home / away goals, other match statistics potentially) to the input tensors be a way of reducing loss? Ideally after training, I would only want to provide two teams (home and away) as inputs to the model for it to predict the result.

@ptrblck sorry for the nudge, but any thoughts here on my last reply?

  1. Tanh may make for a better activation layer than Sigmoid for intermediate layers.
  2. Conv1d or a TransformerEncoder may provide better results, as games further away in time may have less impact on the outcome. Structure the data so that input dims are something like [ batch_size, num_game_season, (win/tie/loss, score ratio)]
  3. You could encode the results of past games with Win = 1.0, Tie = 0.5, Loss = 0.0 for inputs and probability distribution for outputs.
  4. Dropout on the intermediate layers may help. TransformerEncoder can be set with the dropout argument.
  5. Simply using a score ratio of loser/winner scores could be added as a second channel, or 0.5 for tie(that will prevent divide by zero in the case of 0 / 0).
1 Like

Thanks very much! Will give these a go. :pray:

More as a side note: Did you also try more traditional models (e.g., Decision Trees, Random Forests, Gradient Boosted Trees)? I wouldn’t be surprised if those work better for your type of structured data – at least this is my observations with my course projects (classification or regression task over structured data): neural network-based models never come out on top when team compare different methods.

1 Like

Thanks @vdw , I haven’t tried traditional models like the ones you mention, will look into trying and comparing their results against the above soon. :slight_smile:

@J_Johnson me again, I’m slowly working my way through your and @vdw 's options here to understand what will work best.

I’m still learning about Transformers, for the TransformerEncoder approach you mentioned, would you be able to share a rough example of what a TransformerEncoder model itself might look like in this instance? I’ve started a thread here as I’m having trouble adapting the PyTorch example to a classification task like this. Happy to reward you for your time. :pray:

In the case of a language model using a TransformerEncoder, you have the following:

First, there is an embedding layer. This takes every word and vectorizes it, giving it some sort of semantic meaning. Usually, there is an input of size batch size, sequence length and output of batch size, sequence length, embedding dim(assuming batch_first = True). The embedding dim contains qualitative information about the meaning of each word. Perhaps a certain row corresponds with hot/cold, where coffee might score 0.2 and Iced Tea might score 0.8. So the data and training establishes these values and meanings.

Likewise, each team may have some qualitative value. Perhaps this team scores higher on defense and strategy, but lower on speed and accuracy. Anyway, the model can learn it’s own internal qualitative representation of each team during training.

Those embeddings then enter the TransformerEncoder, which is something like batch size, sequence length, embedding dim. You want some history of who is playing who. Such as you might have your input sequence be in pairs (home_0, away_0, home_1, away_1, …) where each of those are teams and are ordered by oldest to most recent matches. In language models, the transformer layer is used to attend to what the next token prediction may be. In this case, though, you would not be predicting next token, but would be predicting the outcome of the last two in the sequence, I.e. Cowboys vs. the Eagles.

In a language model, you then have a linear layer which predicts the next token by giving out a probability score of every possible word, something of size batch size, all possible words. And then you argmax on that or use some random selection of the top 10% or … Anyway, next token selection in LLMs is irrelevant in your use case, so I’ll spare you that explanation. You would then have a final Linear layer which, instead of being next token prediction, is predicting the outcome of the game, something of size batch size, 3(win/lose/tie).

1 Like

Thank you so much for this thorough reply! I’ll give it a whirl soon and report back.