The Predictive Playbook – From Clean Data to First Predictions

Posted by:

|

On:

|

📊 The data is ready — now it’s time to put it to work.
In the last post, I walked through how I scraped and cleaned 15 years of NFL stats into a structured dataset. That gave me a foundation to build on.

This time, I’m taking the first step into modeling: training machine learning algorithms to predict fantasy football points.


🧩 Feature Engineering: Turning Stats into Signals

Raw stats are helpful, but machine learning models perform best when we feed them richer features. From the cleaned dataset, I engineered new variables designed to capture both volume and efficiency:

  • Passing Yards per Attempt – efficiency for quarterbacks
  • Rushing Yards per Attempt – efficiency for running backs
  • Receiving Yards per Reception – efficiency for wide receivers and tight ends
  • Total Touches – combined rushing attempts + receptions (volume)
  • Touchdowns per Touch – scoring efficiency
  • Fumbles per Touch – risk adjustment
  • Changed Teams – indicator for players who switched teams between seasons
  • Position Indicators – one-hot encoding for QB, RB, WR, TE

These, combined with standard box score stats (yards, touchdowns, games played, etc.), gave the model a wide lens on each player’s performance profile.


🧠 Trying Different Models

To see which algorithms best fit this problem, I tested four approaches:

  1. Linear Regression – a baseline, simple linear relationships.
  2. K-Nearest Neighbors (KNN) – compares players to “similar” players.
  3. Random Forest – an ensemble of decision trees that captures complex interactions.
  4. XGBoost – a gradient boosting method that often excels in structured tabular data.

📊 Comparing Model Performance

I trained each model on historical data and evaluated predictions against actual fantasy points using R² (fit) and MAE (error):

ModelR² (higher better)MAE (lower better)Notes
Linear Regression0.42~52 ptsCaptures basic trends, misses complexity
KNN0.45~50 ptsFinds patterns in “similar” seasons but inconsistent
Random Forest0.61~42 ptsStrong balance of accuracy + interpretability
XGBoost0.64~40 ptsBest raw performance, but more tuning required

The clear takeaway: tree-based models (Random Forest, XGBoost) consistently outperformed the simpler methods.


🔑 Feature Importance

One of the strengths of tree-based models like Random Forest is interpretability. By looking at feature importance, we can see which stats most influence predictions.

📊 Figure 1: Top 15 features ranked by importance in the Random Forest model

Not surprisingly, last season’s fantasy points, total touches, and efficiency metrics drive much of the signal.


📈 Actual vs. Predicted Fantasy Points

How well does the model line up with reality? Below is a scatter plot comparing predicted vs. actual fantasy points for the test set.

📊 Figure 2: Actual vs. predicted fantasy points (Random Forest model)

We can see that the model does a decent job of capturing overall trends, even if there’s natural variation and some outliers.


📉 Residuals: Where the Model Misses

Finally, it’s worth looking at the residuals (errors) — the difference between actual and predicted fantasy points.

📊 Figure 3: Distribution of residuals (prediction errors)

The majority of errors are within a reasonable range, but a few players swing well above or below expectations — often explained by injuries, breakout seasons, or unique circumstances that stats alone can’t fully capture.


📈 First Predictions for 2025

After training, I applied the models to the 2024 stats to generate projections for 2025.

Some highlights from the Random Forest model:

  • Josh Allen (BUF, QB) – projected QB1 overall
  • Lamar Jackson (BAL, QB) – QB2
  • Saquon Barkley (PHI, RB) – RB1
  • Ja’Marr Chase (CIN, WR) – WR1
  • Surprise: Baker Mayfield (TB, QB) cracked the top 5

These predictions aren’t final rankings — they’re a first iteration, but they already look competitive with industry projections.


📝 What I Learned

  1. Feature engineering matters – volume + efficiency stats improved accuracy noticeably.
  2. Random Forest is a great balance – interpretable, strong accuracy, and robust.
  3. XGBoost may be the future – slightly better results, but requires careful tuning to avoid overfitting.

🚀 What’s Next

This is just the beginning. In the next post, I’ll refine the models further — tuning hyperparameters, comparing Random Forest with XGBoost head-to-head, and exploring how the models handle in-season weekly predictions.

The ultimate goal: predictions that aren’t just accurate, but actionable for drafts, trades, and weekly start/sit decisions.


🔗 Code and notebooks: GitHub Repo
🔗 Previous post: Cleaning the Data

Posted by

in