Introduction:
Sports betting has become a significant industry in the U.S., growing at an exponential rate. In 2023 alone, Americans wagered a record-breaking $119.84 billion on sports betting, according to the American Gaming Association. As sports betting continues to expand in popularity, particularly with the increasing legalization in various states, this figure is expected to rise further in the coming years.
When you think about it, sports betting is essentially applied data science. Bettors and odds-makers alike analyze vast amounts of data — player performance, team stats, injury reports, and other factors — to make predictions about the outcome of games. The objective of this project is to determine whether a simple supervised machine learning model, specifically using linear regression, can effectively predict the passing yards of NFL quarterbacks. By leveraging historical data, we aim to see if our model can arrive at conclusions similar to those of professional oddsmakers in Las Vegas.
So, if you have been following this project since the beginning you would know we started by following Cowboys quarterback Dak Prescott, however shortly after Dak had a season ending injury, big changes were made.
The next two QBs with longest tenured starting position for a single team, are Patrick Mahomes and Josh Allen. As well as Aaron Rodgers for being by far the largest outlier of all starting QBs. Lastly CJ Stroud and Jayden Daniels to see how the model handles limited data.
Data:
Code:
Since this new version of the project is working with multiple QBs, and I want to eventually scale this to all NFL starting QBs, I built the model to be modular.
This how it works, the individual QBs career dataset is used to train the model. Currently the model is a Elastic Net model, which is a combination of both Ridge (L2 regularization) and Lasso (L1 Regularization) models. The model takes a parameter called l1_ratio which controls how much of the overall model is influenced by both individual models, a value of1 being all L1 and value of 0 being all L2, for right now we will be using a 0.5 ratio value. The good thing about L1 regularization is a pseudo built in in feature selection. Essentially if the model sees no correlation between a feature and a predicted outcome it sets that features weight to zero, which works well here where were working with similar structure of data, but that data has different outcomes due to play styles or experience.
Once the model is trained, we can make our prediction of the next game with our target being passing yards. we need inputs, guesses basically to feed to the model to make a prediction.
Inputs Module:
class Stats:
def __init__(self, age, completions, attempts, passingYards, passingTDs, interceptions, sacks ):
self.age = age
self.cmp = completions
self.att = attempts
self.yards = passingYards
self.td = passingTDs
self.int = interceptions
self.sacks = sacks