The Predictive Playbook – From Clean Data to First Predictions

📊 The data is ready — now it’s time to put it to work.
In the last post, I walked through how I scraped and cleaned 15 years of NFL stats into a structured dataset. That gave me a foundation to build on.

This time, I’m taking the first step into modeling: training machine learning algorithms to predict fantasy football points.

🧩 Feature Engineering: Turning Stats into Signals

Raw stats are helpful, but machine learning models perform best when we feed them richer features. From the cleaned dataset, I engineered new variables designed to capture both volume and efficiency:

Passing Yards per Attempt – efficiency for quarterbacks
Rushing Yards per Attempt – efficiency for running backs
Receiving Yards per Reception – efficiency for wide receivers and tight ends
Total Touches – combined rushing attempts + receptions (volume)
Touchdowns per Touch – scoring efficiency
Fumbles per Touch – risk adjustment
Changed Teams – indicator for players who switched teams between seasons
Position Indicators – one-hot encoding for QB, RB, WR, TE

These, combined with standard box score stats (yards, touchdowns, games played, etc.), gave the model a wide lens on each player’s performance profile.

🧠 Trying Different Models

To see which algorithms best fit this problem, I tested four approaches:

Linear Regression – a baseline, simple linear relationships.
K-Nearest Neighbors (KNN) – compares players to “similar” players.
Random Forest – an ensemble of decision trees that captures complex interactions.
XGBoost – a gradient boosting method that often excels in structured tabular data.

📊 Comparing Model Performance

I trained each model on historical data and evaluated predictions against actual fantasy points using R² (fit) and MAE (error):

Model	R² (higher better)	MAE (lower better)	Notes
Linear Regression	0.42	~52 pts	Captures basic trends, misses complexity
KNN	0.45	~50 pts	Finds patterns in “similar” seasons but inconsistent
Random Forest	0.61	~42 pts	Strong balance of accuracy + interpretability
XGBoost	0.64	~40 pts	Best raw performance, but more tuning required

The clear takeaway: tree-based models (Random Forest, XGBoost) consistently outperformed the simpler methods.

🔑 Feature Importance

One of the strengths of tree-based models like Random Forest is interpretability. By looking at feature importance, we can see which stats most influence predictions.

📊 Figure 1: Top 15 features ranked by importance in the Random Forest model

Not surprisingly, last season’s fantasy points, total touches, and efficiency metrics drive much of the signal.

📈 Actual vs. Predicted Fantasy Points

How well does the model line up with reality? Below is a scatter plot comparing predicted vs. actual fantasy points for the test set.

📊 Figure 2: Actual vs. predicted fantasy points (Random Forest model)

We can see that the model does a decent job of capturing overall trends, even if there’s natural variation and some outliers.

📉 Residuals: Where the Model Misses

Finally, it’s worth looking at the residuals (errors) — the difference between actual and predicted fantasy points.

📊 Figure 3: Distribution of residuals (prediction errors)

The majority of errors are within a reasonable range, but a few players swing well above or below expectations — often explained by injuries, breakout seasons, or unique circumstances that stats alone can’t fully capture.

📈 First Predictions for 2025

After training, I applied the models to the 2024 stats to generate projections for 2025.

Some highlights from the Random Forest model:

Josh Allen (BUF, QB) – projected QB1 overall
Lamar Jackson (BAL, QB) – QB2
Saquon Barkley (PHI, RB) – RB1
Ja’Marr Chase (CIN, WR) – WR1
Surprise: Baker Mayfield (TB, QB) cracked the top 5

These predictions aren’t final rankings — they’re a first iteration, but they already look competitive with industry projections.

📝 What I Learned

Feature engineering matters – volume + efficiency stats improved accuracy noticeably.
Random Forest is a great balance – interpretable, strong accuracy, and robust.
XGBoost may be the future – slightly better results, but requires careful tuning to avoid overfitting.

🚀 What’s Next

This is just the beginning. In the next post, I’ll refine the models further — tuning hyperparameters, comparing Random Forest with XGBoost head-to-head, and exploring how the models handle in-season weekly predictions.

The ultimate goal: predictions that aren’t just accurate, but actionable for drafts, trades, and weekly start/sit decisions.

🔗 Code and notebooks: GitHub Repo
🔗 Previous post: Cleaning the Data